Re: Contribution and committer guidelines

2019-01-29 Thread Sanjay Pujare
I too agree with the sentiments expressed here and good to know there are
people who think like me.

In terms of examples, you can see the PR
https://github.com/apache/apex-core/pull/569 which has a long discussion
about failing tests and who has the onus to prove what etc. In a recent PR
that abossert

opened,
Vlad threatened to -1 an idea that was not even suggested (
https://github.com/apache/apex-core/pull/607#discussion_r248498412) . As
Amol mentioned vetoes are not rare here and I didn't have to wait too long
for an example to turn up.


On Tue, Jan 29, 2019 at 3:14 PM amol kekre  wrote:

> Justin,
> I agree with your thoughts. Vetos are not rare in Apex. We are trying to
> figure a way to get there.
>
> Amol
>
> On Tue, Jan 29, 2019 at 3:01 PM Justin Mclean 
> wrote:
>
> > Hi,
> >
> > If someone submits what you think is poor quality code just point it out
> > to them and ask them to fix it or even better fix it yourself to show
> them
> > what is expected. Vetoing something list that seems a little heavy handed
> > and is not the best way to encourage community growth. It’s better to
> > improve the quality of others contributions rather than blocking them
> from
> > contributing. Vetos in practice are very rare, how many have actually
> > occurred in this project? Wouldn't it be better to focus on practical
> ways
> > to get people involved and increase contribution rather than hypothetical
> > situations of when to veto a code change?
> >
> > Thanks,
> > Justin
>


Re: Contribution and committer guidelines

2019-01-25 Thread Sanjay Pujare
+1


On Fri, Jan 25, 2019 at 5:20 PM Pramod Immaneni 
wrote:

> Our contributor and committer guidelines haven't changed in a while. In
> light of the discussion that happened a few weeks ago, where
> high commit threshold was cited as one of the factors discouraging
> submissions, I suggest we discuss some ideas and see if the guidelines
> should be updated.
>
> I have one. We pick some reasonable time period like a month after a PR is
> submitted. If the PR review process is still going on *and* there is a
> disagreement between the contributor and reviewer, we will look to see if
> the submission satisfies some acceptable criteria and if it does we accept
> it. We can discuss what those criteria should be in this thread.
>
> The basics should be met, such as code format, license, copyright, unit
> tests passing, functionality working, acceptable performance and resolving
> logical flaws identified during the review process. Beyond that, if there
> is a disagreement with code quality or refactor depth between committer and
> contributor or the contributor agrees but does not want to spend more time
> on it at that moment, we accept the submission and create a separate JIRA
> to track any future work. We can revisit the policy in future once code
> submissions have picked up and do what's appropriate at that time.
>
> Thanks
>


Re: [DISCUSS] Time for attic?

2019-01-09 Thread Sanjay Pujare
Vlad,

Thanks for the clarification. So the threat of shutting down the project
(aka moving to attic) as a way to energize the community. One way to do it
and I guess works in many cases...

We should also look into more positive ways of encouraging participation
and contribution. Hopefully it's not too late for that.

Sanjay

On Wed, Jan 9, 2019 at 2:01 PM Vlad Rozov  wrote:

> Hi Sanjay, long time, no see.
>
> This is my attempt to mobilize the community and see if we can revive some
> activity on the project. Note that the same discussion happened among PMCs
> members 2 month ago and there were promises to contribute back to the
> project with no new PRs being open. Should you follow the e-mail thread
> thoroughly, you would see that the move to the attic was questioned by the
> Apache board, so it was not me who initiated the move. I simply made
> community aware that Apex is the subject of the move if it continues the
> way it was for the last 6-8 months. With the current activity and no
> commitment to make new contributions Apex does belong to the attic.
>
> Thank you,
>
> Vlad
>
> > On Jan 9, 2019, at 12:27, Sanjay Pujare  wrote:
> >
> > Hi Vlad
> >
> > I have been watching this debate from the sidelines and just decided to
> > jump in.
> >
> > As an Apex PMC member you said "... I am responsible for maintaining the
> > correct state of the project...". But let's face it, you don't actually
> > have to do it just like you didn't have to make contributions, make
> > proposals or do all sorts of other things for keeping the project alive
> and
> > vibrant. Without doing absolutely anything, PMC members will continue to
> > remain PMC members which applies to you as well.
> >
> > So at this stage, you may do nothing, you may start mobilizing for making
> > contributions, you may stop being a member of PMC or you may start this
> > drive to move the project to the attic. May I know why you chose the last
> > option instead of one of the others? It will be good to know your answer
> > before discussing the details of people staying away or being
> discouraged.
> >
> > Sanjay
> >
> >
> >
>
>


Re: [DISCUSS] Time for attic?

2019-01-09 Thread Sanjay Pujare
Hi Vlad

I have been watching this debate from the sidelines and just decided to
jump in.

As an Apex PMC member you said "... I am responsible for maintaining the
correct state of the project...". But let's face it, you don't actually
have to do it just like you didn't have to make contributions, make
proposals or do all sorts of other things for keeping the project alive and
vibrant. Without doing absolutely anything, PMC members will continue to
remain PMC members which applies to you as well.

So at this stage, you may do nothing, you may start mobilizing for making
contributions, you may stop being a member of PMC or you may start this
drive to move the project to the attic. May I know why you chose the last
option instead of one of the others? It will be good to know your answer
before discussing the details of people staying away or being discouraged.

Sanjay



On Wed, Jan 9, 2019 at 10:50 AM Vlad Rozov  wrote:

> Amol,
>
> Without details who decided to stay away and why it is not very
> constructive and does not tell me whether I should continue or not. Again,
> I’d like to see what policies needs to be changed that will bring more
> contributions and won’t affect quality of the code. If you have a concrete
> proposal, please post it here, so the community can decide whether to
> accept them or not.
>
> It is always the case that a contributor decides what to contribute. The
> goal of the e-mail thread is to see if there are few contributors who plan
> to contribute in the new future and what do they plan to contribute.
>
> Thank you,
>
> Vlad
>
> > On Jan 9, 2019, at 10:18, amol kekre  wrote:
> >
> > Vlad,
> > Would you want to continue to be involved in the project, even if this
> > involvement is itself causing community folks to stay away? If the issue
> is
> > cultural, things will not improve. Doing the same thing again and
> expecting
> > different result will not work. Why not change the policies that
> enforces a
> > different culture, and then wait 6 months to see if things change. With
> > regards to listing features, that needs to be something that the
> > contributors should decide.
> >
> > Amol
> >
> >
> > On Wed, Jan 9, 2019 at 9:22 AM Vlad Rozov  wrote:
> >
> >> Remember that to vote -1 it is necessary to provide justification, so
> I’d
> >> like to see the justifications and the plan from those who do not want
> to
> >> move Apex to the attic. I am also not very happy that my past efforts
> will
> >> be placed in the attic, but let’s face the reality. It is not that I
> don’t
> >> want to be involved in the project, but as the PMC I am responsible for
> >> maintaining the correct state of the project and with the current level
> of
> >> contributions, IMO, it belongs the the attic.
> >>
> >> Thank you,
> >>
> >> Vlad
> >>
> >>> On Jan 9, 2019, at 09:02, Pramod Immaneni 
> >> wrote:
> >>>
> >>> What would be the purpose of such a vote? From the discussions it is
> >> quite
> >>> apparent that there is a significant, possibly majority view that
> project
> >>> shouldn’t go to attic. The same could be reported to the board, can’t
> it?
> >>> Like I also said if you or others don’t like where the project is at
> and
> >>> feel it is a dead end, you don’t have to continue to be involved with
> the
> >>> project and that’s your prerogative. Let others who want to continue,
> >> take
> >>> it forward, why try to force your will on to everyone.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Jan 9, 2019 at 8:43 AM Vlad Rozov  wrote:
> >>>
>  Without concrete details of what will be committed (support for k8s,
>  hadoop 3.x, kafka 2.x, etc) and what requirements in code submission
> >> needs
>  to be relaxed (well written java code, consistent code style,
> successful
>  build with passing unit tests in CI, providing unit test, etc) the
>  statements below are way too vague. Note that I started this e-mail
> >> thread
>  with the intention to see what contributions the community may expect.
>  Without concrete details of the future contribution, I’ll submit a
> vote
> >> by
>  end of January.
> 
>  Thank you,
> 
>  Vlad
> 
> > On Jan 9, 2019, at 00:47, priyanka gugale  wrote:
> >
> > I do believe and know of some work done in private forks by people.
> >> There
> > could be couple of reasons why it didn't go public. One could be high
> >> bar
> > for code submission (I don't have references at hand but that's
> general
> > feeling amongst committers) and other could be lack of motivation.
> >
> > Let's try to put some efforts to re-survive the work, motivate
>  committers,
> > and take hard decisions later if nothing works. A product like Apex /
> > Malhar definitely deserves to survive.
> >
> > -Priyanka
> >
> > On Wed, Jan 9, 2019 at 12:07 PM Atri Sharma  wrote:
> >
> >> The reason for a private fork was due to potential IP conflicts with
> >> my current organization. I am 

Re: [Proposal] Extension of the Apex configuration to add dependent jar files in runtime.

2018-02-02 Thread Sanjay Pujare
In cases where we have an "über" docker image containing support for
multiple execution environments it might be useful for the Apex core to
infer what kind of execution environment to use for a particular
invocation  (say based on configuration values/environment variables) and
in that case the core will load the corresponding libraries. And I think
this kind of flexibility or support would be difficult through the plugins
hence I think Sergey's proposal will be useful.

Sanjay


On Fri, Feb 2, 2018 at 11:18 AM, Sergey Golovko 
wrote:

> Unfortunately the moving of .apa file to a docker image cannot resolve all
> problems with the dependencies. If we assume an Apex application should be
> run in different execution environments, the application docker image must
> contain all possible execution environment dependencies.
>
> I think the better way is to assume that the original application docker
> image like the current .apa file should contain the application specific
> dependencies only. And some smart client tool should create the executable
> application docker image form the original one and include the execution
> specific environment dependencies into the target application docker image.
> It means anyway an smart client Apex tool should have an interface to
> define different environment dependencies or combination of different
> dimensions of the environment dependencies.
>
> Thanks,
> Sergey
>
>
> On Fri, Feb 2, 2018 at 10:23 AM, Thomas Weise  wrote:
>
> > The current dependencies are based on how Apex YARN client works. YARN
> > depends on a DFS implementation for deployment (not necessarily HDFS).
> >
> > I think a better way to look at this is to consider that instead of an
> .apa
> > file the application is a docker image, which would contain Apex and all
> > dependencies that the "StramClient"  today adds for YARN.
> >
> > In that world there would be no Apex CLI or Apex specific client.
> >
> > Thomas
> >
> >
> >
> > On Thu, Feb 1, 2018 at 5:57 PM, Sergey Golovko 
> > wrote:
> >
> > > I agree. It can be implemented with usage of plugins. But if I need to
> > > enable and configurate the plugin I need to put this information into
> > > dt-site.xml. It means The plugin and its parameter must be documented
> and
> > > the list of the added specific jars will be visible and available for
> > > updates to the end-user. The implementation via plugins is more dynamic
> > > solution that is more convenient for the application developers. But
> I'm
> > > talking about the static configuration of the Apex build or
> installation
> > > that relates more to the platform development.
> > >
> > > The current Apex core implementation uses the static unchanged list of
> > jars
> > > for long time, because the Apex implementation still contains several
> > basic
> > > static assumptions (for instance, the usage of YARN, HDSF, etc.). And
> the
> > > current Apex assumptions are hardcoded in the implementation. But if we
> > are
> > > going to improve Apex and use Java interfaces in generic Apex
> > > implementation, the current static approach in Apex code to hardcode a
> > list
> > > of dependent jars will not work anymore. It will require to include a
> new
> > > solution to add/change jars in specific Apex builds/configurations.
> And I
> > > don't think the usage of the plugins will be good for that.
> > >
> > > Thanks,
> > > Sergey
> > >
> > >
> > > On Thu, Feb 1, 2018 at 1:47 PM, Vlad Rozov  wrote:
> > >
> > > > There is a way to get the same end result by using plugins. It will
> be
> > > > good to understand why plugin can't be used and can they be extended
> to
> > > > provide the required functionality.
> > > >
> > > > Thank you,
> > > >
> > > > Vlad
> > > >
> > > >
> > > > On 1/29/18 15:14, Sergey Golovko wrote:
> > > >
> > > >> Hello All,
> > > >>
> > > >> In Apex there are two ways to deploy non-Hadoop jars to the deployed
> > > >> cluster.
> > > >>
> > > >> The first approach is static (hardcoded) and it is used by Apex
> > platform
> > > >> developers only. There are several final static arrays of Java
> classes
> > > >> in StramClient.java
> > > >> that define which of the available jars should be included into
> > > deployment
> > > >> for every Apex application.
> > > >>
> > > >> The second approach is to add paths of all dependent jar-files to
> the
> > > >> value
> > > >> of the attribute LIB_JARS. The end-user can set/update the value of
> > the
> > > >> attribute LIB_JARS via dt-site.xml files, command line parameters,
> > > >> application properties and plugins. The usage of the
> > > >> attribute LIB_JARS is the official documented way for all Apex users
> > to
> > > >> manage by the deployment jars.
> > > >>
> > > >> But some of the dependent jars (not from the Apex core) can be
> common
> > > for
> > > >> all customer's applications for a specific installation and/or
> > execution
> > > >> 

Re: [Proposal] Switch to Java 8

2018-01-19 Thread Sanjay Pujare
Are they mutually exclusive? Otherwise "start supporting Java 9" doesn't
have to imply stopping Java 8 support. I think Vlad implied supporting both.

Sanjay

On Fri, Jan 19, 2018 at 10:56 AM, Pramod Immaneni 
wrote:

> Java 9 will shut people out as 8 is not yet eol. The majority of the
> installations I have seen are running 8 and people are moving from 7 to 8.
> Other projects in our space are also moving to 8.
>
> > On Jan 19, 2018, at 8:31 AM, Vlad Rozov  wrote:
> >
> > +1. It will be good to start supporting Java 9.
> >
> > Thank you,
> >
> > Vlad
> >
> > On 1/18/18 11:01, Hitesh Kapoor wrote:
> >> +1.
> >>
> >> --Hitesh
> >>
>
>


Re: [Proposal] Simulate setting for application launch

2017-12-21 Thread Sanjay Pujare
It is relatively easy to describe the justification for this change without
getting into the weeds and hairsplitting words.

A DAG is built not only to launch an application but also to let a user
visualize and configure it. Currently "populateDAG" is the only method we
require application writers to implement and they implement it with the
goal of running the application. So it can use properties, configuration
and code that is really only needed if you want to "run" the DAG.

As mentioned above a perfectly valid use case is that a platform allows a
user to construct a DAG, visualize it and then attach configuration values
to various components in the DAG, save these values as some kind of a
"configuration package" and then at a future date run the DAG with this
setup. This is consistent with the view that construction of a pipeline and
execution of the pipeline are 2 separate phases and should be delineated as
such.

If you understand and agree with the justification we can work on improving
the original proposal.

Sanjay


On Thu, Dec 21, 2017 at 8:27 AM, Vlad Rozov <vro...@apache.org> wrote:

> "Sometimes" is not a use case. Config is not a context.
>
> Without concrete use cases the proposed change is not well justified.
> populateDAG() is supposed to populate DAG, not to record anything in an
> external system. It was a design goal for plugins.
>
> Thank you,
>
> Vlad
>
>
> On 12/20/17 02:23, Priyanka Gugale wrote:
>
>> +1
>> Sometimes this context is required. We shouldn't change any default
>> behaviour other than making this config available.
>>
>> -Priyanka
>>
>>
>>
>> On Wed, Dec 20, 2017 at 5:32 AM, Pramod Immaneni <pra...@datatorrent.com>
>> wrote:
>>
>> The external system recording was just an example, not a specific use
>>> case.
>>> The idea is to provide comprehensive information to populateDAG as to the
>>> context it is being called under. It is akin to the test mode or simulate
>>> flag that you see with various utilities. The platform cannot control
>>> what
>>> populateDAG does, even without this information, in multiple calls that
>>> you
>>> mention the application can return different DAGs by depending on
>>> any external factor such as time of day or some external variable. This
>>> is
>>> to merely provide more context information in the config. It is upto the
>>> application to do what it wishes with it.
>>>
>>> On Tue, Dec 19, 2017 at 2:28 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>
>>> -0.5: populateDAG() may be called by the platform as many times as it
>>>> needs (even in case it calls it only once now to launch an application).
>>>> Passing different parameters to populateDAG() in simulate launch mode
>>>> and
>>>> actual launch may lead to different DAG being constructed for those two
>>>> modes. Can't the use case you described be handled by a plugin?
>>>>
>>>> Thank you,
>>>>
>>>> Vlad
>>>>
>>>>
>>>> On 12/19/17 10:06, Sanjay Pujare wrote:
>>>>
>>>> +1 although I prefer something that is more enforceable. So I like the
>>>>> idea
>>>>> of another method but that introduces incompatibility so may be in 4.0?
>>>>>
>>>>> On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath <
>>>>> amberar...@yahoo.com.invalid> wrote:
>>>>>
>>>>>+1
>>>>>
>>>>>> Ram
>>>>>>   On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni <
>>>>>> pra...@datatorrent.com> wrote:
>>>>>>
>>>>>>I have a mini proposal. The command get-app-package-info runs the
>>>>>> populateDAG method of an application to construct the DAG but does not
>>>>>> actually launch the DAG. An application developer does not know in
>>>>>>
>>>>> which
>>>
>>>> context the populateDAG is being called. For example, if they are
>>>>>> recording
>>>>>> application starts in an external system from populateDAG, they will
>>>>>>
>>>>> have
>>>
>>>> false entries there. This can be solved in different ways such as
>>>>>> introducing another method in StreamingApplication or more parameters
>>>>>> to populateDAG but a non disruptive option would be to add a property
>>>>>>
>>>>> in
>>>
>>>> the configuration object that is passed to populateDAG to indicate if
>>>>>>
>>>>> it
>>>
>>>> is
>>>>>> simulate/test mode or real launch. An application developer can use
>>>>>>
>>>>> this
>>>
>>>> property to take the appropriate actions.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>


Re: [Proposal] Simulate setting for application launch

2017-12-19 Thread Sanjay Pujare
+1 although I prefer something that is more enforceable. So I like the idea
of another method but that introduces incompatibility so may be in 4.0?

On Tue, Dec 19, 2017 at 9:40 AM, Munagala Ramanath <
amberar...@yahoo.com.invalid> wrote:

>  +1
> Ram
> On Tuesday, December 19, 2017, 8:33:21 AM PST, Pramod Immaneni <
> pra...@datatorrent.com> wrote:
>
>  I have a mini proposal. The command get-app-package-info runs the
> populateDAG method of an application to construct the DAG but does not
> actually launch the DAG. An application developer does not know in which
> context the populateDAG is being called. For example, if they are recording
> application starts in an external system from populateDAG, they will have
> false entries there. This can be solved in different ways such as
> introducing another method in StreamingApplication or more parameters
> to populateDAG but a non disruptive option would be to add a property in
> the configuration object that is passed to populateDAG to indicate if it is
> simulate/test mode or real launch. An application developer can use this
> property to take the appropriate actions.
>
> Thanks
>
>


Re: [ANNOUNCE] New Apache Apex Committer: Ananth Gundabattula

2017-11-03 Thread Sanjay Pujare
Congratulations, Ananth

Sanjay


On Fri, Nov 3, 2017 at 1:50 AM, Thomas Weise  wrote:

> The Project Management Committee (PMC) for Apache Apex is pleased to
> announce Ananth Gundabattula as new committer.
>
> Ananth has been contributing to the project for about a year. Highlights:
>
> * Cassandra and Kudu operators with in-depth analysis/design work
> * Good collaboration, adherence to contributor guidelines and ownership of
> work
> * Work beyond feature focus such as fixing pre-existing test issues that
> impact CI
> * Presented at YOW Data and Dataworks Summit Australia
> * Enthusiast, contributes on his own time
>
> Welcome, Ananth, and congratulations!
> Thomas, for the Apache Apex PMC.
>


Re: [ANNOUNCE] New Apache Apex PMC: Tushar Gosavi

2017-11-03 Thread Sanjay Pujare
Congratulations, Tushar!

Sanjay


On Fri, Nov 3, 2017 at 9:17 AM, Pramod Immaneni 
wrote:

> A bit delayed but nevertheless important announcement, Apache Apex PMC is
> pleased to announce Tushar Gosavi as a new PMC member.
>
> Tushar has been contributing to Apex from the beginning of the project and
> has been working on the codebase for over 3 years. He is among the few who
> have a wide breadth of contribution, including both core and malhar, from
> internal changes to user facing api, from input/output operators to
> components that support operators, and has a good overall understanding of
> the codebase and how it works.
>
> His salient contributions over the years are
>
>- Module support in Apex
>- Operator additions and improvements such as S3, File input and output,
>partitionable unique count and dynamic partitioning improvements
>- Initial WAL implementation from which subsequent implementations were
>derived for different use cases
>- Plugin support for Apex
>- Various bug fixes and improvements in both malhar and core that you
>can find in the JIRA
>- Participated in long-term project maintenance tasks such as
>refactoring operators and demos
>- Participated in important feature discussions
>- Reviewed and committed pull requests from contributors
>- Participated in conducting and teaching in an Apex workshop at a
>university and speaking at Apex conference organized by DataTorrent
>
> Conference talks & Presentations
>
>1. Presentations at VIIT and PICT Pune
>2. http://www.apexbigdata.com/pune-platform-talk-9.html
>3. http://www.apexbigdata.com/pune-integration-talk-4.html
>4. Webinar on Smart Partitioning with Apex. (
>https://www.youtube.com/watch?v=HCATB1zlLE4
>)
>5. Presented about customer use case at Pune Meetup in 2016
>
> Pramod for the Apache Apex PMC.
>


Re: checking dependencies for known vulnerabilities

2017-11-02 Thread Sanjay Pujare
>>>>>> an action on resolving the issue? The only possible
> explanation
> > >>>>>>>>>> that I
> > >>>>>>>>>> see
> > >>>>>>>>>> is the one that I already mentioned on this thread.
> > >>>>>>>>>>
> > >>>>>>>>>> If you see this as unnecessary obstacles for legitimate
> > >>>>>>>>>> contributions,
> > >>>>>>>>>> why
> > >>>>>>>>>> to enforce code style, it is also unnecessary obstacle. Unit
> > test?
> > >>>>>>>>>> Should
> > >>>>>>>>>> it be considered to be optional for a PR to pass unit tests as
> > >>>>>>>>>> well?
> > >>>>>>>>>> What
> > >>>>>>>>>> if an environment change on CI side causes build to fail
> similar
> > >>>>>>>>>> to
> > >>>>>>>>>> what
> > >>>>>>>>>> happened recently? Should we disable CI builds too and rely
> on a
> > >>>>>>>>>> committer
> > >>>>>>>>>> or a release manager to run unit tests?  If CI build fails for
> > >>>>>>>>>> whatever
> > >>>>>>>>>> reason, how can you be sure that if it fails for another PR as
> > >>>>>>>>>> well,
> > >>>>>>>>>> that
> > >>>>>>>>>> they both fail for the same reason and there is no any other
> > >>>>>>>>>> reasons
> > >>>>>>>>>> that
> > >>>>>>>>>> will cause a problem with a PR?
> > >>>>>>>>>>
> > >>>>>>>>>> I don't know how failing PRs because of CVE, which we don't
> > >>>>>>>>>> introduce,
> > >>>>>>>>>>
> > >>>>>>>>>> don't control, no idea of and possibly unrelated would fall in
> > the
> > >>>>>>>>> same
> > >>>>>>>>> bucket as unit tests. I am at a loss of words for that. There
> is
> > no
> > >>>>>>>>> reason
> > >>>>>>>>> to block legitimate development for this. This can be handled
> > >>>>>>>>> separtely
> > >>>>>>>>> and
> > >>>>>>>>> in parallel. Maybe there is a way we can setup an independent
> job
> > >>>>>>>>> on
> > >>>>>>>>> a
> > >>>>>>>>> build server that runs nightly, fails if there are new CVEs
> > >>>>>>>>> discovered
> > >>>>>>>>> and
> > >>>>>>>>> sends an email out to the security or dev group. You could even
> > >>>>>>>>> reduce
> > >>>>>>>>> the
> > >>>>>>>>> CVE threshold for this. I don't believe in a stick approach
> > (carrot
> > >>>>>>>>> and
> > >>>>>>>>> stick metaphor) and believe in proportional measures.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thank you,
> > >>>>>>>>>
> > >>>>>>>>> Vlad
> > >>>>>>>>>>
> > >>>>>>>>>>> On 10/26/17 09:42, Pramod Immaneni wrote:
> &

Re: checking dependencies for known vulnerabilities

2017-09-12 Thread Sanjay Pujare
For a vendor too, quality ought to be as important as security so I don't
think we disagree on the cost benefit analysis. But I get your drift.

By "creative incentive" I didn't imply any material incentive (although a
gift card would be nice :-)) but more along the lines of what a community
can do to recognize such contribution.

Sanjay

On Tue, Sep 12, 2017 at 8:10 AM, Vlad Rozov <vro...@apache.org> wrote:

> I guess we have a different view on the benefit and cost definition. For
> me the benefit of fixing CI build, flaky unit test, severe security issue
> is huge for the community and is possibly small (except for a security
> issues) for a vendor.
>
> By "creative" I hope you don't mean that other community members, users
> and customers send a contributor a gift cards to compensate for the cost
> :). For me PR that is blocked on a failed CI build is sufficiently
> incentive for a contributor to look into why it fails and fixing it.
>
> Thank you,
>
> Vlad
>
> On 9/11/17 23:58, Sanjay Pujare wrote:
>
>> I don't want to speak for others and I don't want to generalize. But an
>> obvious answer could be "cost-benefit analysis".
>>
>> In any case we should come up with a creative way to "incentivize" members
>> to do these tasks.
>>
>


Re: checking dependencies for known vulnerabilities

2017-09-12 Thread Sanjay Pujare
I don't want to speak for others and I don't want to generalize. But an
obvious answer could be "cost-benefit analysis".

In any case we should come up with a creative way to "incentivize" members
to do these tasks.

On Mon, Sep 11, 2017 at 10:59 PM, Vlad Rozov <vro...@apache.org> wrote:

> So, you are saying that those members are eager to see new features, new
> functionalities and new code added to the project? Why they are not eager
> to see a unit test being fixed or a dependency with a severe security risk
> being removed? It is not that their original PR would be closed as a result
> of a unit test fix. What prevents those community members to put time and
> effort to fix CI build (unit test, dependency) that will directly benefit
> the community but may not immediately benefit a vendor and its (paying)
> customers?
>
> Thank you,
>
> Vlad
>
> On 9/11/17 22:08, Sanjay Pujare wrote:
>
>> Comments inline:
>>
>>
>> On 9/10/17 23:40, Priyanka Gugale wrote:
>>>
>>> It's good idea to check for vulnerabilities, but as Pramod said all
>>>> softwares / libraries are going to have some or other vulnerability at
>>>> any
>>>> time. I will go with approach of "let's discuss this addition" and we
>>>> should not affect PRs which are not adding any new dependencies (due to
>>>> old
>>>> vulnerabilities).
>>>>
>>>> While all software/libraries are subject to insecure code and
>>> vulnerabilities, all software vendors whether open or close source
>>> hopefully try to make code more secure rather than insecure. If there is
>>> an
>>> existing or newly introduced dependency with a critical security issue, I
>>> don't see why Apex community wants to accept the high probability of
>>> being
>>> exposed to a security exploit. The only reasonable explanation for me is
>>> that the community members do not care about overall project quality and
>>> care only for tasks/PRs assigned to them by somebody else. I'll be glad
>>> to
>>> hear a different explanation for the proposal not to penalize PRs that do
>>> not introduce new dependencies and are affected by a newly found
>>> vulnerability in an existing dependency. Will not we all be penalized
>>> later
>>> if we don't fix it?
>>>
>>>
>> I take exception to the insinuation that (some) community members "care
>> only for tasks/PRs assigned to them by somebody else". It is quite
>> possible
>> or likely that these members are eager to see new features, new
>> functionalities, or new code added to the project because they get excited
>> by such things. You need to take into account the mindset of people who
>> are
>> submitting PRs to add a new functionality or fix a bug. The PR author's
>> focus correctly is on addressing that particular JIRA and ensuring that
>> JIRA gets resolved at the highest quality. To burden that PR author with
>> unrelated considerations of build systems, vulnerability findings and such
>> is not fair. Note that the project is (or should be) primarily driven by
>> users (and customers in case of vendors shipping this code in products)
>> who
>> use these features and pay for these features. So we need to balance the
>> long term concerns about "security issues" and quality with the immediate
>> term concerns about adding features and functionalities.
>>
>>
>>
>> Also I also strongly feel, we need to be meticulous and think it through
>>>> before introducing such checks for reasons discussed before.
>>>>
>>>> +1. Equally applies to a newly introduced functionality and bug fixes.
>>>
>>> Totally agree. However when we discuss or "think through" any concerns
>> they
>> should apply to the issue at hand (i.e. the newly introduced functionality
>> and bug fixes) and not external factors.
>>
>>
>> -Priyanka
>>>>
>>>>
>>>>
>


Re: checking dependencies for known vulnerabilities

2017-09-11 Thread Sanjay Pujare
Comments inline:


> On 9/10/17 23:40, Priyanka Gugale wrote:
>
>> It's good idea to check for vulnerabilities, but as Pramod said all
>> softwares / libraries are going to have some or other vulnerability at any
>> time. I will go with approach of "let's discuss this addition" and we
>> should not affect PRs which are not adding any new dependencies (due to
>> old
>> vulnerabilities).
>>
> While all software/libraries are subject to insecure code and
> vulnerabilities, all software vendors whether open or close source
> hopefully try to make code more secure rather than insecure. If there is an
> existing or newly introduced dependency with a critical security issue, I
> don't see why Apex community wants to accept the high probability of being
> exposed to a security exploit. The only reasonable explanation for me is
> that the community members do not care about overall project quality and
> care only for tasks/PRs assigned to them by somebody else. I'll be glad to
> hear a different explanation for the proposal not to penalize PRs that do
> not introduce new dependencies and are affected by a newly found
> vulnerability in an existing dependency. Will not we all be penalized later
> if we don't fix it?
>


I take exception to the insinuation that (some) community members "care
only for tasks/PRs assigned to them by somebody else". It is quite possible
or likely that these members are eager to see new features, new
functionalities, or new code added to the project because they get excited
by such things. You need to take into account the mindset of people who are
submitting PRs to add a new functionality or fix a bug. The PR author's
focus correctly is on addressing that particular JIRA and ensuring that
JIRA gets resolved at the highest quality. To burden that PR author with
unrelated considerations of build systems, vulnerability findings and such
is not fair. Note that the project is (or should be) primarily driven by
users (and customers in case of vendors shipping this code in products) who
use these features and pay for these features. So we need to balance the
long term concerns about "security issues" and quality with the immediate
term concerns about adding features and functionalities.



>
>> Also I also strongly feel, we need to be meticulous and think it through
>> before introducing such checks for reasons discussed before.
>>
> +1. Equally applies to a newly introduced functionality and bug fixes.
>

Totally agree. However when we discuss or "think through" any concerns they
should apply to the issue at hand (i.e. the newly introduced functionality
and bug fixes) and not external factors.


>
>> -Priyanka
>>
>>
>>>


Re: following committer guideline when merging PR

2017-09-08 Thread Sanjay Pujare
May be you can help clear the confusion for me. From the contribution
guidelines you cited, it looks like it is already the contributor's
responsibility to ensure all JIRA fields are correct and update them when
the PR is merged.

But in your original email you cited apex-malhar PR 669 as an example where
the committer overlooked this responsibility. What am I missing?

On Fri, Sep 8, 2017 at 1:01 PM, Vlad Rozov <vro...@apache.org> wrote:

> Sanjay,
>
> Please feel free to propose changes to the existing Apex contribution
> guidelines, but prior to doing that I'd strongly recommend all community
> members at least to review what already exists and to follow the guidelines
> [1]. Additionally, it will be good that the community members understand
> how Apache Apex contribution and release processes work, and not simply
> engage in voting without engaging into discussion and following existing
> guidelines. All this is an indication that except for few individuals the
> community is not mature and independent.
>
> Thank you,
>
> Vlad
>
> [1] http://apex.apache.org/contributing.html
>
> "Before starting work, have a JIRA assigned to yourself. If you want to
> work on a ticket that is assigned to someone else, send a courtesy e-mail
> to the assignee to check if you can take it over. Confirm type, priority
> and other JIRA fields (often default values are not the best fit)."
>
> On 9/8/17 11:37, Sanjay Pujare wrote:
>
>> Regarding updating JIRA (item#3) I think it should ideally be the
>> contributor's responsibility to update the JIRA with all the required
>> fields and not the committer's.
>>
>> On Fri, Sep 8, 2017 at 11:25 AM, Vlad Rozov <v.rozo...@gmail.com> wrote:
>>
>> item #3. The concern with #661 is with all items that I marked in red in
>>> my first email.
>>>
>>> Thank you,
>>>
>>> Vlad
>>>
>>> On 9/8/17 10:48, Pramod Immaneni wrote:
>>>
>>> What's your concern with #669. It's a fix for a build issue (which you
>>>> created) and was approved by two committers. Wasn't getting builds to
>>>> successful state asap one of your top concerns based on your comments
>>>> and
>>>> -1 on #569 on core.
>>>>
>>>> On Fri, Sep 8, 2017 at 9:16 AM, Vlad Rozov <v.rozo...@gmail.com> wrote:
>>>>
>>>> Committers,
>>>>
>>>>> Please make sure to follow Apex community guideline when merging PR
>>>>> http://apex.apache.org/contributing.html.
>>>>>
>>>>> 1. Ensure that basic requirements for a pull request are met. This
>>>>>  includes:
>>>>>* Sufficient time has passed for others to review
>>>>>* PR was suffiently reviewed and comments were addressed.
>>>>>  Seevoting policy <https://www.apache.org/founda
>>>>> tion/voting.html
>>>>>
>>>>>> .
>>>>>>
>>>>>* When there are multiple reviewers, wait till other reviewers
>>>>>  approve, with timeout of 48 hours before merging
>>>>>* /If the PR was open for a long time, email dev@ declaring
>>>>> intent
>>>>>  to merge/
>>>>>* Commit messages and PR title need to reference JIRA (pull
>>>>>  requests will be linked to ticket)
>>>>>* /Travis CI and Jenkins pull request build needs to pass/
>>>>>* /Ensure tests are added/modified for new features or fixes/
>>>>>* Ensure appropriate JavaDoc comments have been added
>>>>>* Verify contributions don't depend on incompatible licences
>>>>>  (seehttps://www.apache.org/legal/resolved.html#category-x)
>>>>> 2. Use the github/rebase and merge/option or the git command line to
>>>>>  merge the pull request (see link|view command line options|on the
>>>>> PR).
>>>>> 3. /Update JIRA after pushing the changes. Set the|Fix
>>>>>  version|field and resolve the JIRA with proper resolution.
>>>>> *Also
>>>>>  verify that other fields (type, priority, assignee) are correct*./
>>>>>
>>>>>
>>>>> A couple of recent PR merges (#661, #669) to apex-malhar require a
>>>>> second
>>>>> look from the committers.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Vlad
>>>>>
>>>>>
>>>>>
>


Re: following committer guideline when merging PR

2017-09-08 Thread Sanjay Pujare
Regarding updating JIRA (item#3) I think it should ideally be the
contributor's responsibility to update the JIRA with all the required
fields and not the committer's.

On Fri, Sep 8, 2017 at 11:25 AM, Vlad Rozov  wrote:

> item #3. The concern with #661 is with all items that I marked in red in
> my first email.
>
> Thank you,
>
> Vlad
>
> On 9/8/17 10:48, Pramod Immaneni wrote:
>
>> What's your concern with #669. It's a fix for a build issue (which you
>> created) and was approved by two committers. Wasn't getting builds to
>> successful state asap one of your top concerns based on your comments and
>> -1 on #569 on core.
>>
>> On Fri, Sep 8, 2017 at 9:16 AM, Vlad Rozov  wrote:
>>
>> Committers,
>>>
>>> Please make sure to follow Apex community guideline when merging PR
>>> http://apex.apache.org/contributing.html.
>>>
>>> 1. Ensure that basic requirements for a pull request are met. This
>>> includes:
>>>   * Sufficient time has passed for others to review
>>>   * PR was suffiently reviewed and comments were addressed.
>>> Seevoting policy >> >.
>>>   * When there are multiple reviewers, wait till other reviewers
>>> approve, with timeout of 48 hours before merging
>>>   * /If the PR was open for a long time, email dev@ declaring intent
>>> to merge/
>>>   * Commit messages and PR title need to reference JIRA (pull
>>> requests will be linked to ticket)
>>>   * /Travis CI and Jenkins pull request build needs to pass/
>>>   * /Ensure tests are added/modified for new features or fixes/
>>>   * Ensure appropriate JavaDoc comments have been added
>>>   * Verify contributions don't depend on incompatible licences
>>> (seehttps://www.apache.org/legal/resolved.html#category-x)
>>> 2. Use the github/rebase and merge/option or the git command line to
>>> merge the pull request (see link|view command line options|on the
>>> PR).
>>> 3. /Update JIRA after pushing the changes. Set the|Fix
>>> version|field and resolve the JIRA with proper resolution. *Also
>>> verify that other fields (type, priority, assignee) are correct*./
>>>
>>>
>>> A couple of recent PR merges (#661, #669) to apex-malhar require a second
>>> look from the committers.
>>>
>>> Thank you,
>>>
>>> Vlad
>>>
>>>
>


Re: [jira] [Commented] (APEXCORE-780) Travis-CI Build failures to be fixed

2017-08-30 Thread Sanjay Pujare
No go ahead.

On Aug 30, 2017 7:14 AM, "Vlad Rozov (JIRA)"  wrote:

>
> [ https://issues.apache.org/jira/browse/APEXCORE-780?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=16147287#comment-16147287 ]
>
> Vlad Rozov commented on APEXCORE-780:
> -
>
> [~sanjaypujare] do you mind if I take over the JIRA?
>
> > Travis-CI Build failures to be fixed
> > 
> >
> > Key: APEXCORE-780
> > URL: https://issues.apache.org/jira/browse/APEXCORE-780
> > Project: Apache Apex Core
> >  Issue Type: Bug
> >Reporter: Sanjay M Pujare
> >Assignee: Sanjay M Pujare
> >Priority: Minor
> >
> > Automated builds triggered by PRs are failing on Travis-CI
> intermittently for unknown reasons. Needs to be fixed to make CI useful.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
>


Re: Input needed for ApexCli.

2017-08-05 Thread Sanjay Pujare
+1 for best effort. I don't think a flag to offer alternative behavior is
of much value.

On Aug 5, 2017 11:31 AM, "AJAY GUPTA"  wrote:

> It could be useful to have a flag and let user decide the best approach fit
> for him. We can have the default behaviour as best-effort with support for
> Validate and fail via flag.
>
> Ajay
>
> On Sat, 5 Aug 2017 at 8:33 AM, Bhupesh Chawda 
> wrote:
>
> > +1 for best effort with warnings.
> >
> > ~ Bhupesh
> >
> > On Aug 4, 2017 23:46, "Pramod Immaneni"  wrote:
> >
> > > I would prefer "Best effort" with warnings for the ones that are
> invalid.
> > >
> > > On Fri, Aug 4, 2017 at 9:42 AM, Florian Schmidt <
> flor...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hey everyone,
> > > >
> > > > I am currently extending the ApexCli so that the `shutdown-app`
> command
> > > > supports the both the appId and the appName as an argument (see
> > > > https://issues.apache.org/jira/browse/APEXCORE-767 <
> > > > https://issues.apache.org/jira/browse/APEXCORE-767>)
> > > >
> > > > During the review of the pull request, the following discussion came
> > up:
> > > >
> > > > When a user passes multiple appNames / appIds to the shutdown command
> > > > (e.g. shutdown-app appA appB appC) and e.g. appB does not exists,
> which
> > > one
> > > > of the two approaches do we want to go:
> > > >
> > > > "Best effort”: Try to shutdown all those apps where we can find an
> app
> > to
> > > > the provided appName or appId. Print a warning if an app cannot
> found.
> > > >
> > > > “Validate and Fail”: Validate that all apps can be found by the
> > provided
> > > > appId / appName. Do not run the command if one of the apps can’t be
> > found
> > > >
> > > > This decision would probably influence the behavior of other CLI
> > commands
> > > > in the future as well, so that they all behave in a consistent way.
> > What
> > > > are your opinions?
> > > >
> > > > Regards
> > > >
> > > > Florian
> > > >
> > > >
> > > >
> > > >
> > >
> >
>


Re: Difference between setup() and activate()

2017-07-29 Thread Sanjay Pujare
The Javadoc comment
for com.datatorrent.api.Operator.ActivationListener  (in
https://github.com/apache/apex-core/blob/master/api/src/main/java/com/datatorrent/api/Operator.java)
should hopefully answer your questions.

Specifically:

1. No, setup() is called only once in the entire lifetime (
http://apex.apache.org/docs/apex/operator_development/#setup-call)

2. Yes. When an operator is "activated" - first time in its life or
reactivation after a failover -  actuvate() is called before the first
beginWindow() is called.

3. Yes.


On Sun, Jul 30, 2017 at 12:18 AM, Ananth G  wrote:

> Hello All,
>
> I was looking at the documentation and could not get a clear distinction
> of behaviours for setup() and activate() during scenarios when an operator
> is passivated ( ex: application shutdown, repartition use cases ) and being
> brought back to life again. Could someone from the community advise me on
> the following questions ?
>
> 1. Is setup() called in these scenarios (serialize/deserialize cycles) as
> well ?
>
> 2. I am assuming activate() is called in these scenarios ? - The javadoc
> for activation states that the activate() can be called multiple times (
> without explicitly stating why ) and my assumption is that it is because of
> these scenarios.
>
> 3. If setup() is only called once during the lifetime of an operator , is
> it fair to assume that activate() is the best place to resolve all of the
> transient fields of an operator ?
>
>
> Regards,
> Ananth


Re: Malhar CI builds fail

2017-07-16 Thread Sanjay Pujare
Root cause analysis... Let me take a look...

On Jul 16, 2017 9:44 AM, "Pramod Immaneni"  wrote:

> RCA?
>
> On Sat, Jul 15, 2017 at 10:21 AM, Vlad Rozov 
> wrote:
>
> > Anyone to RCA why Malhar CI builds fail?
> >
> > Thank you,
> >
> > Vlad
> >
>


Re: Regarding including zip file as part of source code

2017-07-07 Thread Sanjay Pujare
>Other python dependencies required by user python code as now needs to be
>installed on Hadoop node.

So you are assuming Python itself is installed on each Hadoop node and
other Python packages needed by user's Python code are also installed? Can
your py4j package be installed thru "pip" - Python's native package
manager? If that's possible why does it need to be added to the Apex repo?
The user will anyway need to ensure all Python dependencies are satisfied
and your py4j package is just one of them. We can just document the
dependency, can't we?





On Fri, Jul 7, 2017 at 8:46 AM, vikram patil  wrote:

> Thanks Vlad, Thomas .
>
> Py4j acts as a bridge between Java and Python . So py4j  package is needed
> ( like jar ) for code to work properly.
> If it is installed on Hadoop node then it won't be needed but since this
> code is used as part of basic framework to establish bridge between java
> and python framework, it needs to be transported at application launch.
>
> Other python dependencies required by user python code as now needs to be
> installed on Hadoop node.
>
> Thanks & Regards,
> Vikram
>
>
>
>


Re: Maintenance of Bigtop Apex component

2017-06-05 Thread Sanjay Pujare
Thomas,

I am interested but I want an idea of the time commitment needed. We can
communicate offline if you want.

Sanjay



On Sun, Jun 4, 2017 at 10:22 PM, Chinmay Kolhatkar 
wrote:

> Hi Thomas,
>
> Even I'm currently occupied and won't be able to work on this as of now.
> Also, you're already collaborator on docker-pool repository.
>
> -Chinmay.
>
>
> On Sat, Jun 3, 2017 at 3:08 PM, Aniruddha Thombare <
> anirud...@datatorrent.com> wrote:
>
> > Hi,
> >
> > I'm currently preoccupied, won't be able to work on this as of now.
> >
> > Thanks,
> >
> > A
> >
> >
> > _
> > Sent with difficulty, I mean handheld ;)
> >
> > On 1 Jun 2017 10:23 am, "Thomas Weise"  wrote:
> >
> > > Hi,
> > >
> > > I recently tried to figure out what is involved in maintaining the Apex
> > > component of Bigtop and related artifacts. The main recurring task is
> to
> > up
> > > the version, but of course help may be needed for any other reason.
> > >
> > > Unfortunately I could not find any documentation for the version
> > upgrade. I
> > > added the following page to gather information for the future:
> > >
> > > https://cwiki.apache.org/confluence/display/APEX/ApexBigtop
> > >
> > > Currently the following maintainers are on record for Bigtop:
> > >
> > > chinmay , aniruddha 
> > >
> > > Are Chinmay and Aniruddha still available for this? Does someone else
> > need
> > > to step up?
> > >
> > > I'm looking for the instructions to update the Docker image:
> > >
> > > https://issues.apache.org/jira/browse/APEXCORE-730
> > >
> > > Thanks,
> > > Thomas
> > >
> >
>


Re: File control tuples

2017-06-03 Thread Sanjay Pujare
The way this is supposed to work is if the downstream operator wants both
data+metadata then it connects to this new port ("...which would carry the
meta data along with the actual tuple...") otherwise it connects to the
legacy port which continues to behave the same way. Note that each tuple on
the new port has metadata as well.

+1 for the proposal.

On Sat, Jun 3, 2017 at 9:37 AM, Thomas Weise  wrote:

> How does this relate to the batch control tuples work?
>
> With a separate port, how can a downstream operator relate the metadata to
> the tuples emitted from the primary port?
>
> --
> sent from mobile
> On Jun 2, 2017 12:06 PM, "Bhupesh Chawda"  wrote:
>
> ​Hi,
>
> ​
>
> Emitting ​file
> ​information
> for a file based source like a file input operator
> ​in malhar ​
> seems
> ​like
> a
> ​good
> feature to provide. It is useful information for any downstream operator to
> ​know
> that a data tuple belongs to a certain file
> ​ for instance​
> .
>
>
> We propose to add capability in the abstract file input operator to emit
> file control tuples. These control tuples can include filenames as well as
> any metadata that the user wishes to include along with it.
>
> ​To link this meta data to each tuple, we can add another port to the input
> operator which would carry the meta data along with the actual tuple. We
> can try to reduce the amount of meta data that goes with each tuple by
> having some sort of meta encoding in the control tuple.
>
>
> ~ Bhupesh​
>
>
> ___
>
> Bhupesh Chawda
>
> E: bhup...@datatorrent.com | Twitter: @bhupeshsc
>
> www.datatorrent.com  |  apex.apache.org
>


Re: [VOTE] impersonation and application path

2017-05-23 Thread Sanjay Pujare
+1 for Option 1, Sub-option b.

On Tue, May 23, 2017 at 10:34 AM, Pramod Immaneni 
wrote:

> Based on the discussion on the dev list on this topic, captured here
>  4431016024f61703b248ce7bfb@%3Cdev.apex.apache.org%3E>,
> a few different approaches were discussed including the pros and cons. I
> would like to put the suggested approaches up for a vote. Please go through
> the thread above if you are unfamiliar with the topic. Here are the
> options.
>
> 1. Let the default remain as the impersonating user
>a. The user specifies to use impersonated user's directory by
> setting the APPLICATION_PATH attribute.
>b. The user specifies the use of impersonated user's resources using
> a new setting that will internally set the APPLICATION_PATH to the
> corresponding value and any other resources in the future.
>
> 2. Change the default from current behavior to use the impersonated user's
> resources instead
>a. and b. similar to 1. but specifying impersonating user's
> resources instead.
>
> My vote is for Option 1, Sub-option b).
>
> Thanks
>


Re: impersonation and application path

2017-05-19 Thread Sanjay Pujare
I agree. But how do we use APPLICATION_PATH for this purpose since we need
a Yes/No flag to specify new vs old behavior?

So we have to use a new setting for this (something like
USE_RUNTIME_USER_APPLICATION_PATH ?)

On Fri, May 19, 2017 at 7:57 AM, Pramod Immaneni 
wrote:

> I wouldn't necessarily consider the current behavior a bug and the default
> is fine the way it is today, especially because the user launching the app
> is not the user. APPLICATION_PATH can be used as the setting.
>
> On Fri, May 19, 2017 at 7:43 AM, Vlad Rozov 
> wrote:
>
> > Do I understand correctly that the question is regarding
> > DAGContext.APPLICATION_PATH attribute value in case it is not defined? In
> > this case, I would treat the current behavior as a bug and +1 the
> proposal
> > to set it to the impersonated user B DFS home directory. As
> > APPLICATION_PATH can be explicitly set I do not see a need to provide
> > another settings to preserve the current behavior.
> >
> > Thank you,
> >
> > Vlad
> >
> >
> > On 5/18/17 15:46, Pramod Immaneni wrote:
> >
> >> Sorry typo in sentence "as we are not asking for permissions for a lower
> >> privilege", please read as "as we are now asking for permissions for a
> >> lower privilege".
> >>
> >> On Thu, May 18, 2017 at 3:44 PM, Pramod Immaneni <
> pra...@datatorrent.com>
> >> wrote:
> >>
> >> Apex cli supports impersonation in secure mode. With impersonation, the
> >>> user running the cli or the user authenticating with hadoop (henceforth
> >>> referred to as login user) can be different from the effective user
> with
> >>> which the actions are performed under hadoop. An example for this is an
> >>> application can be launched by user A to run in hadoop as user B. This
> is
> >>> kind of like the sudo functionality in unix. You can find more details
> >>> about the functionalilty here https://apex.apache.org/docs/a
> >>> pex/security/ in
> >>> the Impersonation section.
> >>>
> >>> What happens today with launching an application with impersonation,
> >>> using
> >>> the above launch example, is that even though the application runs as
> >>> user
> >>> B it still uses user A's hdfs path for the application path. The
> >>> application path is where the artifacts necessary to run the
> application
> >>> are stored and where the runtime files like checkpoints are stored.
> This
> >>> means that user B needs to have read and write access to user A's
> >>> application path folders.
> >>>
> >>> This may not be allowed in certain environments as it may be a policy
> >>> violation for the following reason. Because user A is able to
> impersonate
> >>> as user B to launch the application, A is considered to be a higher
> >>> privileged user than B and is given necessary privileges in hadoop to
> do
> >>> so. But after launch B needs to access folders belonging to A which
> could
> >>> constitute a violation as we are not asking for permissions for a lower
> >>> privilege user to access resources of a higher privilege user.
> >>>
> >>> I would like to propose adding a configuration setting, which when set
> >>> will use the application path in the impersonated user's home directory
> >>> (user B) as opposed to impersonating user's home directory (user A). If
> >>> this setting is not specified then the behavior can default to what it
> is
> >>> today for backwards compatibility.
> >>>
> >>> Comments, suggestions, concerns?
> >>>
> >>> Thanks
> >>>
> >>>
> >
>


Re: impersonation and application path

2017-05-18 Thread Sanjay Pujare
+1 for Pramod's proposal for impersonation.

I have an issue with Sandesh's suggestion about making the new behavior as
the default (or only) behavior. This will introduce incompatibility with
other legacy tools (e.g. Datatorrent's dtGateway) that assume user A's HDFS
path as the application path. Because the legacy tools will continue to
assume the old path (user A's path) they will not work with the Apex core
that has this change.

The current behavior might also be preferable to certain users or their
administrators because of not having to deal with multiple HDFS user
directories (for administration, logging, backup etc).

On Thu, May 18, 2017 at 4:01 PM, Sandesh Hegde 
wrote:

> My vote is to make the new proposal as the default behavior. Is there a use
> case for the current behavior? If not then no need to add the configuration
> setting.
>
> On Thu, May 18, 2017 at 3:47 PM Pramod Immaneni 
> wrote:
>
> > Sorry typo in sentence "as we are not asking for permissions for a lower
> > privilege", please read as "as we are now asking for permissions for a
> > lower privilege".
> >
> > On Thu, May 18, 2017 at 3:44 PM, Pramod Immaneni  >
> > wrote:
> >
> > > Apex cli supports impersonation in secure mode. With impersonation, the
> > > user running the cli or the user authenticating with hadoop (henceforth
> > > referred to as login user) can be different from the effective user
> with
> > > which the actions are performed under hadoop. An example for this is an
> > > application can be launched by user A to run in hadoop as user B. This
> is
> > > kind of like the sudo functionality in unix. You can find more details
> > > about the functionalilty here
> > https://apex.apache.org/docs/apex/security/ in
> > > the Impersonation section.
> > >
> > > What happens today with launching an application with impersonation,
> > using
> > > the above launch example, is that even though the application runs as
> > user
> > > B it still uses user A's hdfs path for the application path. The
> > > application path is where the artifacts necessary to run the
> application
> > > are stored and where the runtime files like checkpoints are stored.
> This
> > > means that user B needs to have read and write access to user A's
> > > application path folders.
> > >
> > > This may not be allowed in certain environments as it may be a policy
> > > violation for the following reason. Because user A is able to
> impersonate
> > > as user B to launch the application, A is considered to be a higher
> > > privileged user than B and is given necessary privileges in hadoop to
> do
> > > so. But after launch B needs to access folders belonging to A which
> could
> > > constitute a violation as we are not asking for permissions for a lower
> > > privilege user to access resources of a higher privilege user.
> > >
> > > I would like to propose adding a configuration setting, which when set
> > > will use the application path in the impersonated user's home directory
> > > (user B) as opposed to impersonating user's home directory (user A). If
> > > this setting is not specified then the behavior can default to what it
> is
> > > today for backwards compatibility.
> > >
> > > Comments, suggestions, concerns?
> > >
> > > Thanks
> > >
> >
>


Re: Backward compatibility issue in 3.6.0 release

2017-05-15 Thread Sanjay Pujare
I vote for renaming to less common names like __count. The renaming breaks
compatibility from 3.6.0 to 3.7.0 but seems to be the best option.

On Mon, May 15, 2017 at 9:53 AM, Vlad Rozov  wrote:

> Hi All,
>
> There is a possible change in operators behavior caused by changes that
> were introduced in the release 3.6.0 into DefaultInputPort and
> DefaultOutputPort. Please see https://issues.apache.org/jira
> /browse/APEXCORE-722. We need to agree how to proceed.
>
> 1. Break semantic versioning for the Default Input and Output Ports in the
> next release (3.7.0), declare protected variables as private and provide
> protected access method. Another option is to rename protected variables to
> use less common names (for example __count).
> 2. Keep protected variables with the risk that the following common
> operator design pattern will be used accidentally by existing operators and
> newly designed operators:
>
> public Operator extends BaseOperator {
>   private int count;
>   public DefaultInputPort in = new DefaultInputPort() {
> @Override
> public void process(Object tuple)
> {
>count++;  // updates DefaultInputPort count, not Operator count!
> }
>   }
> }
>
>
> Thank you,
>
> Vlad
>


Enhancement to support custom SSL configuration

2017-04-21 Thread Sanjay Pujare
Currently StrAM supports only the default Hadoop SSL configuration because
it uses org.apache.hadoop.yarn.webapp.WebApps helper class which has the
limitation of only using the default Hadoop SSL config that is read from
Hadoop's ssl-server.xml resource file. Some users have run into a situation
where Hadoops' SSL keystore is not available on most cluster nodes or the
Stram process doesn't have read access to the keystore even when present.
So there is a need for the Stram to use a custom SSL keystore and
configuration that does not suffer from these limitations.

I am planning to fix this by first fixing WebApps in Hadoop and then
enhancing Stram to use this new fix in Hadoop. I have already submitted a
PR https://github.com/apache/hadoop/pull/213 to Hadoop and one of the the
Hadoop distributors has agreed to accept this fix so I expect it to be
merged very soon.

After that I will enhance Stram to accept the location of a custom
ssl-server.xml file (supplied by the client via a DAG attribute or
property) and use the values from that file to set up the config object to
be passed to WebApps which will end up using the custom SSL configuration.
I have already verified this approach in a prototype.

We will also enhance the Apex client/launcher to distribute the custom SSL
files (XML and the keystore) along with the application jars/resources so
the user does not need to pre-distribute the custom SSL files.

Please let me know your comments.

Sanjay


Re: Programmatic log4j appender in Apex

2017-04-09 Thread Sanjay Pujare
Please give some examples and/or use cases of this programmatic log4j
appender.

On Fri, Apr 7, 2017 at 8:40 PM, Sergey Golovko 
wrote:

> Hi All,
>
> I'd like to add supporting of a custom defined log4j appender that can be
> added to Apex Application Master and Containers and be configurable
> programmatically.
>
> Sometimes it is not trivial to control log4j configuration via log4j
> properties. And I think the having of the approach to add a log4j appender
> programmatically will allow the customers and developers to plugin their
> own custom defined log4j appenders and be much flexible for streaming and
> collection of Apex log events.
>
> I assume to provide generic approach for definition of the programmatic
> log4j appender and to pass all configuration parameters including a name of
> the Java class with implementation of the log4j appender via system and/or
> command line properties.
>
> Thanks,
> Sergey
>


Re: open/close ports and active/inactive streams

2017-04-01 Thread Sanjay Pujare
> "It should behave as if the upstream operator does not emit anything on
the port."

But as per your earlier description the upstream operator "..for any reason
decides that it will not emit tuples on the output port, it may close
it..." . So this is not just the case where the upstream operator is not
emitting tuples because there is no data but it has implicitly changed the
DAG by closing the stream. As an example, a port could be receiving events
marking "window" boundaries for aggregating data received from the other
port. It will be useful for the operator to know these events will not be
received any more so it can stop aggregating data from the other port. In
any case your idea of the "EOS" control tuple addresses this.





>> operator needs to behave differently because it's a different DAG now.
>> Will
>> a control tuple inform it that the input port is closing?
>>
> Again, the details will need to be flushed out. It may be an eos (end of
> stream) control tuple. An operator with 2 input ports may not even care
> that one of the input ports is connected to inactive stream. It should
> behave as if the upstream operator does not emit anything on the port.


Re: open/close ports and active/inactive streams

2017-04-01 Thread Sanjay Pujare
Sounds like a good idea, +1

Couple of questions/comments:
- "...re-open the port and wait till the stream becomes active again...".
How is the operator informed that the stream is active again or is it just
a property of the output port that the operator needs to check?
- Say an operator has 2 input ports one of which becomes inactive as per
this scenario. It cannot be undeployed because of the other port but the
operator needs to behave differently because it's a different DAG now. Will
a control tuple inform it that the input port is closing?


On Sat, Apr 1, 2017 at 8:12 AM, Vlad Rozov  wrote:

> All,
>
> Currently Apex assumes that an operator can emit on any defined output
> port and all streams defined by a DAG are active. I'd like to propose an
> ability for an operator to open and close output ports. By default all
> ports defined by an operator will be open. In the case an operator for any
> reason decides that it will not emit tuples on the output port, it may
> close it. This will make the stream inactive and the application master may
> undeploy the downstream (for that input stream) operators. If this leads to
> containers that don't have any active operators, those containers may be
> undeployed as well leading to better cluster resource utilization and
> better Apex elasticity. Later, the operator may be in a state where it
> needs to emit tuples on the closed port. In this case, it needs to re-open
> the port and wait till the stream becomes active again before emitting
> tuples on that port. Making inactive stream active again, requires the
> application master to re-allocate containers and re-deploy the downstream
> operators.
>
> It should be also possible for an application designer to mark streams as
> inactive when an application starts. This will allow the application master
> avoid reserving all containers when the application starts. Later, the port
> can be open and inactive stream become active.
>
> Thank you,
>
> Vlad
>
>


Re: Maven build package error with 3.5

2017-02-27 Thread Sanjay Pujare
The root cause seems to be embedded in your output:

Could not transfer artifact org.apache.apex:malhar-library:pom:3.5.0
from/to central (
https://repo.maven.apache.org/maven2):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target -> [Help 1]

Could this be because a corporate firewall is stopping this access?  You
can check this out
http://stackoverflow.com/questions/25911623/problems-using-maven-and-ssl-behind-proxy


On Mon, Feb 27, 2017 at 3:15 PM, Dongming Liang 
wrote:

> It was running well with 3.4, but now failing with Apex 3.5
>
> ➜  log-aggregator git:(apex-tcp) ✗ mvn package -DskipTests
> [INFO] Scanning for projects...
> [INFO]
> [INFO]
> 
> [INFO] Building Aggregator 1.0-SNAPSHOT
> [INFO]
> 
> Downloading:
> https://repo.maven.apache.org/maven2/org/apache/apex/malhar-
> library/3.5.0/malhar-library-3.5.0.pom
> Downloading:
> https://repo.maven.apache.org/maven2/org/apache/apex/apex-
> api/3.5.0/apex-api-3.5.0.pom
> Downloading:
> https://repo.maven.apache.org/maven2/org/apache/apex/apex-
> common/3.5.0/apex-common-3.5.0.pom
> Downloading:
> https://repo.maven.apache.org/maven2/org/apache/apex/apex-
> engine/3.5.0/apex-engine-3.5.0.pom
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 10.435 s
> [INFO] Finished at: 2017-02-27T15:03:06-08:00
> [INFO] Final Memory: 11M/245M
> [INFO]
> 
> [ERROR] Failed to execute goal on project log-aggregator: Could not resolve
> dependencies for project
> com.capitalone.vault8:log-aggregator:jar:1.0-SNAPSHOT: Failed to collect
> dependencies at org.apache.apex:malhar-library:jar:3.5.0: Failed to read
> artifact descriptor for org.apache.apex:malhar-library:jar:3.5.0: Could
> not
> transfer artifact org.apache.apex:malhar-library:pom:3.5.0 from/to
> central (
> https://repo.maven.apache.org/maven2):
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/
> DependencyResolutionException
> ➜  log-aggregator git:(apex-tcp) ✗
>
> The pom file is:
>
> 
> http://maven.apache.org/POM/4.0.0;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>   4.0.0
>
>   com.mycom.teamx
>   log-aggregator
>   jar
>   1.0-SNAPSHOT
>
>   
>   Aggregator
>   Log Aggregator
>
>   
> 
> 3.5.0
> lib/*.jar
>   
>
>   
> 
>
>  org.apache.maven.plugins
>  maven-eclipse-plugin
>  2.9
>  
>true
>  
>
>
>  maven-compiler-plugin
>  3.3
>  
>UTF-8
>1.7
>1.7
>true
>false
>true
>true
>  
>
>
>  maven-dependency-plugin
>  2.8
>  
>
>  copy-dependencies
>  prepare-package
>  
>copy-dependencies
>  
>  
>target/deps
>runtime
>  
>
>  
>
>
>
>  maven-assembly-plugin
>  
>
>  app-package-assembly
>  package
>  
>single
>  
>  
>${project.artifactId}-${project.version}
> -apexapp
>false
>
>  src/assemble/appPackage.xml
>
>
>  0755
>
>
>  
>${apex.apppackage.classpath}
>${apex.version}
>
> ${project.groupId}
>
> ${project.artifactId}
>
> ${project.version}
>
> ${project.name}
>
> ${project.description} Package-Description>
>  
>
>  
>
>  
>
>
>
>  maven-antrun-plugin
>  1.7
>  
>
>  package
>  
>
>  

Re: Java packages: legacy -> org.apache.apex

2017-02-27 Thread Sanjay Pujare
+1 for bullet 1 assuming new code implies brand new classes (since it
doesn't involve any backward compatibility issues). We can always review
contributor PRs to make sure new code is added with new package naming
guidelines.

But for 2 and 3 I have a question/comment: is there even a need to do it?
There is lots of open source code with package names like com.google.* and
com.sun.* etc and as far as I know there are no moves afoot to rename these
packages. The renaming itself doesn't add any new functionality or
technical capabilities but introduces instability in Apex code as well as
user code. Just a thought...

On Mon, Feb 27, 2017 at 8:23 AM, Chinmay Kolhatkar 
wrote:

> Thomas,
>
> I agree with you that we need this migration to be done but I have a
> different opinion on how to execute this.
> I think if we do this in phases as described above, users might end up in
> more confusion.
>
> For doing this migration, I think it should follow these steps:
> 1. Whether for operator library or core components, we should announce
> widely on dev and users mailing list that "...such change is going to
> happen in next release"
> 2 Take up the work all at once and not phase it.
>
> Thanks,
> Chinmay.
>
>
>
> On Mon, Feb 27, 2017 at 9:09 PM, Thomas Weise  wrote:
>
> > Hi,
> >
> > This topic has come up on several PRs and I think it warrants a broader
> > discussion.
> >
> > At the time of incubation, the decision was to defer change of Java
> > packages from com.datatorrent to org.apache.apex till next major release
> to
> > ensure backward compatibility for users.
> >
> > Unfortunately that has lead to some confusion, as contributors continue
> to
> > add new code under legacy packages.
> >
> > It is also a wider issue that examples for using Apex continue to refer
> to
> > com.datatorrent packages, nearly one year after graduation. More and more
> > user code is being built on top of something that needs to change, the
> can
> > is being kicked down the road and users will face more changes later.
> >
> > I would like to propose the following:
> >
> > 1. All new code has to be submitted under org.apache.apex packages
> >
> > 2. Not all code is under backward compatibility restriction and in those
> > cases we can rename the packages right away. Examples: buffer server,
> > engine, demos/examples, benchmarks
> >
> > 3. Discuss when the core API and operators can be changed. For operators
> we
> > have a bit more freedom to do changes before a major release as most of
> > them are marked @Evolving and users have the ability to continue using
> > prior version of Malhar with newer engine due to engine backward
> > compatibility guarantee.
> >
> > Thanks,
> > Thomas
> >
>


Re: example applications in malhar

2017-02-23 Thread Sanjay Pujare
+ for renaming to examples. While we are at it, how about merging "samples"
also in the new "examples" ?

On Thu, Feb 23, 2017 at 9:47 AM, Munagala Ramanath 
wrote:

> +1 for renaming to "examples"
>
> Ram
>
> On Thu, Feb 23, 2017 at 9:12 AM, Lakshmi Velineni  >
> wrote:
>
> > I am ready to bring the examples over into the demos folder. I was
> > wondering if anybody has any input on Thomas's suggestion to rename the
> > demos folder to examples. I would rather do that first and then bring the
> > examples over instead of doing it the other way around as that would lead
> > to refactoring the new examples again.
> >
> > Thanks
> >
> > On Wed, Jan 25, 2017 at 8:12 AM, Lakshmi Velineni <
> laks...@datatorrent.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Since the examples have little history I was planning to have two
> > > commits for every example, one for the code as the primary author of
> > > the example and another containing pom.xml and other changes to make
> > > it work under malhar.
> > >
> > > Thanks
> > >
> > > On Wed, Nov 2, 2016 at 9:49 PM, Lakshmi Velineni
> > >  wrote:
> > > > Thanks for the suggestions and I am working on the process to migrate
> > the
> > > > examples with the guidelines you mentioned. I will send out a list of
> > > > examples and the destination modules very soon.
> > > >
> > > >
> > > > On Thu, Oct 27, 2016 at 1:43 PM, Thomas Weise <
> thomas.we...@gmail.com>
> > > > wrote:
> > > >>
> > > >> Maybe a good first step would be to identify which examples to bring
> > > over
> > > >> and where appropriate how to structure them in Malhar (for example,
> I
> > > see
> > > >> multiple hdfs related apps that could go into the same Maven
> module).
> > > >>
> > > >>
> > > >> On Tue, Oct 25, 2016 at 1:00 PM, Thomas Weise 
> wrote:
> > > >>
> > > >> > That would be great. There are a few things to consider when
> working
> > > on
> > > >> > it:
> > > >> >
> > > >> > * preserve attribtion
> > > >> > * ensure there is a test that runs the application in the CI
> > > >> > * check that dependencies are compatible license
> > > >> > * maybe extract common boilerplate code from pom.xml
> > > >> >
> > > >> > etc.
> > > >> >
> > > >> > Existing examples are under https://github.com/apache/
> > > >> > apex-malhar/tree/master/demos
> > > >> >
> > > >> > Perhaps we should rename it to "examples"
> > > >> >
> > > >> > I also propose that each app has a README and we add those for
> > > existing
> > > >> > apps as well.
> > > >> >
> > > >> > Thanks,
> > > >> > Thomas
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Oct 25, 2016 at 12:49 PM, Lakshmi Velineni <
> > > >> > laks...@datatorrent.com> wrote:
> > > >> >
> > > >> >>   Can i work on this?
> > > >> >>
> > > >> >> Thanks
> > > >> >> Lakshmi Prasanna
> > > >> >>
> > > >> >> On Mon, Sep 12, 2016 at 9:41 PM, Ashwin Chandra Putta <
> > > >> >> ashwinchand...@gmail.com> wrote:
> > > >> >>
> > > >> >> > Here is the JIRA:
> > > >> >> > https://issues.apache.org/jira/browse/APEXMALHAR-2233
> > > >> >> >
> > > >> >> > On Tue, Sep 6, 2016 at 10:20 PM, Amol Kekre <
> > a...@datatorrent.com>
> > > >> >> wrote:
> > > >> >> >
> > > >> >> > > Good idea to consolidate them into Malhar. We should bring in
> > as
> > > >> >> > > many
> > > >> >> > from
> > > >> >> > > this gitHub as possible.
> > > >> >> > >
> > > >> >> > > Thks
> > > >> >> > > Amol
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > On Tue, Sep 6, 2016 at 6:02 PM, Thomas Weise
> > > >> >> > > 
> > > >> >> > > wrote:
> > > >> >> > >
> > > >> >> > > > I'm also for consolidating these different example
> locations.
> > > We
> > > >> >> should
> > > >> >> > > > also look if all of it is still relevant.
> > > >> >> > > >
> > > >> >> > > > The stuff from the DT repository needs to be brought into
> > shape
> > > >> >> > > > wrt
> > > >> >> > > > licensing, checkstyle, CI support etc.
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On Tue, Sep 6, 2016 at 4:34 PM, Pramod Immaneni <
> > > >> >> > pra...@datatorrent.com>
> > > >> >> > > > wrote:
> > > >> >> > > >
> > > >> >> > > > > Sounds like a good idea. How about merging demos with
> apps
> > as
> > > >> >> well?
> > > >> >> > > > >
> > > >> >> > > > > On Tue, Sep 6, 2016 at 4:30 PM, Ashwin Chandra Putta <
> > > >> >> > > > > ashwinchand...@gmail.com> wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Hi All,
> > > >> >> > > > > >
> > > >> >> > > > > > We have a lot of examples for apex malhar operators in
> > the
> > > >> >> > following
> > > >> >> > > > > > repository which resides outside of malhar.
> > > >> >> > > > > > https://github.com/DataTorrent/examples/tree/
> > > master/tutorials
> > > >> >> > > > > >
> > > >> >> > > > > > Now that it has grown quite a bit, does it make sense
> to
> > > >> >> > > > > > bring
> > > >> >> some
> > > >> >> > > of
> > > >> >> > > > > the
> > > >> >> > > > > > most common 

Re: APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread Sanjay Pujare
Yes, IMO it should be fixed.

On 2/15/17, 6:56 AM, "AJAY GUPTA"  wrote:

This is a bug if we consider a normal database join. All fields from both
POJOs should be emitted in the result irrespective of the name.

Ajay

On Wed, 15 Feb 2017 at 5:55 PM, Hitesh Kapoor 
wrote:

> Hi All,
>
> In PojoInnerJoin accumulation same field names are emitted as single field
> even if we don't take a join on them. For example consider the following 2
> POJO's on 2 streams
>
> POJO1
> {
>id: Int
>age : String
> }
>
> POJO2
> {
>  id: Int
>  age : String
>  name : String
> }
>
> If we wish to take a join only on field id then the resulting stream
> contains the common named field(age) only from POJO2.
> So I am confused whether the resulting stream should contain the field
> 'age' from only POJO1 (or only POJO2) or it should contain the field 'age'
> from both the POJOs.
>
> I think it is a bug which should be fixed and the resulting stream should
> contain common named field from both the POJOs (and maybe rename it in the
> final output). Let me know your thoughts on it.
>
> Regards,
> Hitesh
>





Re: At-least once semantics for Kafka to Cassndra ingest

2017-02-14 Thread Sanjay Pujare
Have you taken a look at 
http://apex.apache.org/docs/apex/application_development/#exactly-once ? i.e. 
setting that processing mode on all the operators in the pipeline .

Join us at Apex Big Data World-San Jose 
, April 4, 2017!
 
http://www.apexbigdata.com/san-jose-register.html
 

On 2/14/17, 12:00 PM, "Himanshu Bari"  wrote:

How to ensure that the Kafka to Cassandra ingestion pipeline in Apex will
guarantee exactly once processing semantics.
Eg. Message was read from Kafka but apex app died before it was written
successfully to Cassandra.





Re: [DISCUSS] Policy for patches

2017-01-27 Thread Sanjay Pujare
A strong +1 for the second approach for the reasons Pramod mentioned.

Is it also possible to “prune” branches so that we have less of this activity 
of merging fixes across branches? If we can ascertain that a certain branch is 
not used by any user/customer (by asking in the community) we should be able to 
remove it. For example, apex-malhar has release-3.6 which is definitely 
required but 3 year old branches like release-0.8.5, release-0.9.0, … telecom 
most probably are not being used by anybody.

On 1/27/17, 8:43 AM, "Pramod Immaneni"  wrote:

Hi,

I wanted to bring up the topic of patches for issues discovered in older
releases and start a discussion to come up with a policy on how to apply
them.

One approach is the patch gets only applied to the release it was
discovered in and master. Another approach is it gets applied to all
release branches >= discovered release and master. There may be other
approaches as well which can come up in this discussion.

The advantage of the first approach is that the immediate work is limited
to a single fix and merge. The second approach requires more work initially
as the patch needs to get applied to more one or more places.

I am tending towards the second approach of applying the fix to all release
branches >= discovered release, while also having some sort of an end of
life policy for releases otherwise it might become difficult to manage the
fixes. The end of life policy would mean that beyond a time period
(something reasonable period between 1 - 3 years) after the release is made
the release branch become frozen and no more fixes are applied to it.
Consider the following two problematic scenarios if we apply the fix only
to the one discovered release.

   - Let's say tomorrow a new fix needs to be made to a newer release
   branch (which had existed at the time of the fix) that is dependent on 
the
   current fix. Before applying the new fix one would need to first know to
   cherry-pick the old fix (may not be trivial to know this) and second and
   more important there could be merge conflicts when cherry-picking. At 
this
   point, you may be trying to resolve the conflicts where you may not be 
the
   primary author of the fix and/or the knowledge of the fix is no longer
   fresh in folks minds. This is prone to errors.


   - Let's call this fix a. Let's say there was another fix b. requiring a.
   to work correctly also on the same release branch but no compile
   dependencies on a. If in future we have a fix c. that depends on b. that
   needs to be applied on a different release branch both b. and a. would 
need
   to be cherry picked. It might be easy to figure out b. is needed, as c. 
has
   a direct dependency to it but would not be easy to determine that a. is
   also needed since that is old knowledge. It will not be easy to catch 
this
   omission because of no compile errors if a. is not included.

Thanks





Re: [DISCUSS] Stram events log levels

2017-01-19 Thread Sanjay Pujare
Strong +1 for the reasons cited by Ajay and Hitesh.

Do we really need to modify the EventInfo class? Wouldn’t the 
BeanUtils.describe(event) automatically populate the EventLogLevel into “data” 
(a Map) as a property? All the downstream code including the 
REST API will then just return the EventLogLevel as part of JSONObject without 
making any changes, right?

Another thing: these levels should ideally match the corresponding entries in 
the dt.log file as some users might expect. 

On 1/19/17, 10:10 AM, "Hitesh Kapoor"  wrote:

+1 it will enhance the user experience and the approach suggested by Ajay
doesn't seems to add significant overhead. Also I feel that in future it
can be enhanced further and UI can also filter the event logs based on
these log levels.

On Jan 19, 2017 10:57 PM, "AJAY GUPTA"  wrote:

> Current event log mechanism : The events recorded are currently recorded 
in
> datatorrent/apps/{appid}/events/folder in files named part{id}.txt .
> eg
> : datatorrent/apps/application_1483801541352_0003/events/part0.txt
>
> We can access these events via REST Api which returns the list of events 
as
> List objects.
>
>
> The below approach can be used for implementing this. Kindly let me know
> your viewpoints/suggestions on the approach.
>
> *Required Changes:*
>
> a) Add a enum field like EventLogLevel in StramEvent, assign the log level
> in constructors of different Stram events like StartOperator,
> StartContainer, StopOperator, etc.
>
> b) Add eventLogLevel in EventInfo class. The events returned via REST api
> from StramClient is a list of EventInfo objects.
>
> c) Changes in EventsAgent - processPartFile() method to parse the
> eventLogLevel logged into the part{id}.txt file
>
>
> On Thu, Jan 19, 2017 at 10:55 PM, AJAY GUPTA  wrote:
>
> > Hi Apex community,
> >
> >
> > Currently, events logged by stram such as StartContainer, StartOperator,
> > StopContainer, StopOperator, etc dont have any associated priority 
level.
> >
> > We can provide log levels for such Stram events. Log levels can be INFO,
> > WARN, ERROR
> > Eg:
> > 1. Start Container, Start Operator are INFO level events
> > 2. OperatorError is ERROR level event
> > 3. Stop Container, Stop Operator are WARN level events
> >
> > Log level for events can help in user experience when showing the event
> > list in a GUI. eg: In datatorrent gateway UI, we can provide color-coded
> > log levels so that user can focus more on ERROR and WARN events.
> >
> >
> >
> >
> >
>





Re: Upgrade Apache Bigtop to Apex Core 3.5.0

2017-01-19 Thread Sanjay Pujare
+1 assuming not a lot of work is involved.

What does it take to add a mention of Apex in http://bigtop.apache.org/ ?

On 1/19/17, 8:12 AM, "Pramod Immaneni"  wrote:

+1

On Thu, Jan 19, 2017 at 12:53 AM, Chinmay Kolhatkar  wrote:

> Dear Community,
>
> Now that Apex core 3.5.0 is released, is it good time to upgrade Apache
> Bigtop for Apex to 3.5.0?
>
> -Chinmay.
>





Re: Shutdown of an Apex app

2017-01-17 Thread Sanjay Pujare
I have similar question/concerns. When a downstream operator initiates a 
shutdown, how is that trigger signaled to and co-ordinated with the input 
operators stopping their input reading activity? Or idempotency etc don’t 
matter after shutdown?



On 1/17/17, 10:31 PM, "Chinmay Kolhatkar"  wrote:

+1 for the feature in general.

Though I have a question about it: Lets say 10th operator in the DAG
initiates the shutdown.. And operator being in quite downstream as compared
to Input Operator, it might be fairly behind in window it is processing.
What happens to the data being already processed by upstream operators?

Is there a possibility where Idempotency might get affected because
termination action is taken in downstream operators?

-Chinmay.


On Wed, Jan 18, 2017 at 11:55 AM, Tushar Gosavi 
wrote:

> I think this would be a great addition for batch use cases or use
> cases were DAG needs to be shutdown after detecting some
> completion/error condition through the operator. We have one Jira
> Opened for such functionality
> https://issues.apache.org/jira/browse/APEXCORE-503.
>
> - Tushar.
>
>
> On Wed, Jan 18, 2017 at 11:45 AM, Bhupesh Chawda
>  wrote:
> > Hi All,
> >
> > Currently we can shutdown an Apex app in the following ways:
> > 1. Throw ShutdownException() from *all* the input operators
> > 2. Use Apex CLI to shutdown an app using the YARN App Id
> >
> > I think we should have some way of shutting down an application from
> within
> > an operator. It is not always true that the trigger for shutdown is sent
> by
> > the input operator only. Sometimes, an end condition may be detected by
> > some operator in the DAG which wants the processing to end. Such a
> > shutdown, although triggered from some intermediate operator in the DAG,
> > should guarantee graceful shut down of the application.
> >
> > Thoughts?
> >
> > ~ Bhupesh
>





Re: [DRAFT] [REPORT] Apache Apex - January 2017

2017-01-05 Thread Sanjay Pujare
About the following sentence:

➢ We continue to see increase in interest and adaption

Shouldn’t it be “adoption” ?


On 1/5/17, 7:47 PM, "Thomas Weise"  wrote:

Below is the draft board report for feedback.

I'm planning to submit it on Monday.

Thanks,
Thomas


## Description:

Apache Apex is a distributed, large-scale, high throughput,
low-latency, fault tolerant, unified stream and batch processing
platform.

## Issues:

There are no issues that require the Board's attention.

## Status/Activity:

In November the community released 3.6.0 of Apex Malhar library, which
added the initial version of SQL support using Apache Calcite,
improvements to windowed state management performance and scalability,
expanded user documentation and many other improvements.

In December followed the 3.5.0 release of Apex Core, which is the stream
processing engine. The previous release was 3.4.0 in May and releases are
generally less frequent than the library. The release contains several
operability improvements and also moves the Apache Hadoop dependency from
2.2 to 2.6, enabling the engine to take advantage of newer features in
Apache Hadoop YARN.

Apex is now also present in Apache Beam; the first version of the
Apex runner is part of the recent 0.4.0 Beam release.

Another integration with Apache SAMOA (incubating) was also completed
recently. Apache SAMOA is an open source platform for mining big data
streams and allows multiple Distributed Stream Processing Engines (DSPEs)
to be integrated into the framework.

Apex was present at Apache Big Data in Seville, at Big Data Spain in Madrid
and also as part of the Apache Beam tutorial at Strata Singapore.
We continue to see increase in interest and adaption, more info can be found
on the Powered By page on the project web site.

## Community:

- Currently 15 PMC members.
- No new PMC member added in the last 3 months
- Last PMC addition was Chandni Singh 2016-09-07

- Currently 40 committers.
- No new committer added in the last 3 months
- Last committer addition was Devendra Tagare on 2016-08-10

- Contributors: 67 all time, 45 in last 12 months

## Releases:

Following are the most recent releases:

- Malhar 3.6.0 released 2016-11-26
- Malhar 3.5.0 released 2016-09-05
- Core 3.5.0 released 2016-12-06





Re: Visitor API for DAG

2016-12-08 Thread Sanjay Pujare
Thinking more about it, it makes sense to have generic hooks for post-launch 
use cases so I support the idea of making it generic.

On 12/7/16, 10:09 AM, "Sanjay Pujare" <san...@datatorrent.com> wrote:

Should we continue the discussion in the JIRA?

Making it generic the way you are suggesting – won’t it cause problems of 
consistency etc e.g. different hooks do different things? Also are there a use 
cases for such generic hooks to justify the effort involved?

On 12/7/16, 7:29 AM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:

Tushar,

Why specifically limit it to client side preparation of the DAG before 
the
application is launched? Why not make it possible to have general hooks
that can apply even when the application is running in the different
distributed components of the application such as the containers and 
stram.
The hooks could be registered to be asynchronous or synchronous. The
asynchronous ones could be handled via a bus, we already have a bus
called mbassador that we use today and it could potentially be used for
this. The entire functionality is akin to something like dtrace without 
the
dynamic part.

Thanks

On Fri, Nov 25, 2016 at 3:24 AM, Tushar Gosavi <tus...@datatorrent.com>
wrote:

> Opened a Jira https://issues.apache.org/jira/browse/APEXCORE-577 for 
this.
>
> - Tushar.
>
>
> On Mon, Nov 21, 2016 at 9:59 PM, Amol Kekre <a...@datatorrent.com> 
wrote:
> > Ananth,
> > The current API allows changing properties of a running app. The new
> > proposed API is not needed to do so.
> >
> > Thks
> > Amol
> >
> >
> > On Sun, Nov 20, 2016 at 10:43 PM, Tushar Gosavi 
<tus...@datatorrent.com>
> > wrote:
> >
> >> Hi Ananth,
> >>
> >> We can not change runtime properties through this API. The current
> >> flow of apex application execution is
> >>
> >> 1) StramClient prepares application, inject properties from
> >> properties.xml / user provided xml files and validates dag
> >> 2) StramClient copies required jars and serialized plan to HDFS and
> >> launch master container.
> >> 3) Application master reads serialised plan from HDFS and starts
> >> deploying StramClient as per deployment plan.
> >>
> >> The visitor will examine the DAG in stage 1, hence visitor can only
> >> change the initial state. The execution of the application is not
> >> affected.
> >>
> >> - Tushar.
> >>
> >>
> >> On Sat, Nov 19, 2016 at 3:40 AM, ananth <ananthg.a...@gmail.com> 
wrote:
> >> > How does this work for the stateful operators ? Can we use this 
to
> >> override
> >> > properties that are deserialized ?
> >> >
> >> > Regards,
> >> >
> >> > Ananth
> >> >
> >> >
> >> >
> >> > On 18/11/16 05:53, Tushar Gosavi wrote:
> >> >>
> >> >> The code will execute before application master is launched, it 
is
> >> >> just one time activity during application startup. Few use 
cases I
> >> >> could think are
> >> >>
> >> >> - Operator validation/configuration validator
> >> >>jdbc operator could check if database is accessible with 
given
> >> >> credentials.
> >> >>file output operator could if directory exists and 
filesystem is
> >> >> writable.
> >> >>
> >> >> - Injection of properties in operators from external sources.
> >> >>
> >> >> - If two operator wants to exchange some information based on
> >> >> configuration, they could do it through visitor. for example
> >> >> TUPLE_SCHEMA can be set on downstream operator port based on
> operators
> >> >> input TUPLE_SCHEMA and its configuration (for example projection
> >> >> operator which drops few columns, could create a new class with 
fewer
> >&g

Re: Visitor API for DAG

2016-12-07 Thread Sanjay Pujare
Should we continue the discussion in the JIRA?

Making it generic the way you are suggesting – won’t it cause problems of 
consistency etc e.g. different hooks do different things? Also are there a use 
cases for such generic hooks to justify the effort involved?

On 12/7/16, 7:29 AM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:

Tushar,

Why specifically limit it to client side preparation of the DAG before the
application is launched? Why not make it possible to have general hooks
that can apply even when the application is running in the different
distributed components of the application such as the containers and stram.
The hooks could be registered to be asynchronous or synchronous. The
asynchronous ones could be handled via a bus, we already have a bus
called mbassador that we use today and it could potentially be used for
this. The entire functionality is akin to something like dtrace without the
dynamic part.

Thanks

On Fri, Nov 25, 2016 at 3:24 AM, Tushar Gosavi <tus...@datatorrent.com>
wrote:

> Opened a Jira https://issues.apache.org/jira/browse/APEXCORE-577 for this.
>
> - Tushar.
>
>
> On Mon, Nov 21, 2016 at 9:59 PM, Amol Kekre <a...@datatorrent.com> wrote:
> > Ananth,
> > The current API allows changing properties of a running app. The new
> > proposed API is not needed to do so.
> >
> > Thks
> > Amol
> >
> >
> > On Sun, Nov 20, 2016 at 10:43 PM, Tushar Gosavi <tus...@datatorrent.com>
> > wrote:
> >
> >> Hi Ananth,
> >>
> >> We can not change runtime properties through this API. The current
> >> flow of apex application execution is
> >>
> >> 1) StramClient prepares application, inject properties from
> >> properties.xml / user provided xml files and validates dag
> >> 2) StramClient copies required jars and serialized plan to HDFS and
> >> launch master container.
> >> 3) Application master reads serialised plan from HDFS and starts
> >> deploying StramClient as per deployment plan.
> >>
> >> The visitor will examine the DAG in stage 1, hence visitor can only
> >> change the initial state. The execution of the application is not
> >> affected.
> >>
> >> - Tushar.
> >>
> >>
> >> On Sat, Nov 19, 2016 at 3:40 AM, ananth <ananthg.a...@gmail.com> wrote:
> >> > How does this work for the stateful operators ? Can we use this to
> >> override
> >> > properties that are deserialized ?
> >> >
> >> > Regards,
> >> >
> >> > Ananth
> >> >
> >> >
> >> >
> >> > On 18/11/16 05:53, Tushar Gosavi wrote:
> >> >>
> >> >> The code will execute before application master is launched, it is
> >> >> just one time activity during application startup. Few use cases I
> >> >> could think are
> >> >>
> >> >> - Operator validation/configuration validator
> >> >>jdbc operator could check if database is accessible with given
> >> >> credentials.
> >> >>file output operator could if directory exists and filesystem is
> >> >> writable.
> >> >>
> >> >> - Injection of properties in operators from external sources.
> >> >>
> >> >> - If two operator wants to exchange some information based on
> >> >> configuration, they could do it through visitor. for example
> >> >> TUPLE_SCHEMA can be set on downstream operator port based on
> operators
> >> >> input TUPLE_SCHEMA and its configuration (for example projection
> >> >> operator which drops few columns, could create a new class with 
fewer
> >> >> fields and set it as tuple schema on downstream operator port).
> >> >>
> >> >> - For pojo enabled operator (port where TUPLE_SCHEMA is defined), a
> >> >> efficient stream codec could be written using asm library for
> >> >> serialisation and use that as stream codec instead of default one.
> >> >>
> >> >> -Tushar.
> >> >>
> >> >>
> >> >> On Thu, Nov 17, 2016 at 11:35 PM, Sanjay Pujare <
> san...@da

Re: [GitHub] apex-core pull request #426: APEXCORE-558 Change highlight color to red and ...

2016-12-05 Thread Sanjay Pujare
The build triggered by this PR failed with this weird error (see 
https://builds.apache.org/job/Apex_Core_PR/204/console)

 

> git clean -fdx # timeout=10

Parsing POMs

Modules changed, recalculating dependency graph

Build timed out (after 20 minutes). Marking the build as failed.

Build was aborted

Putting comment on the pull request

Finished: FAILURE

 

 

Does anyone know what the problem is? How does one fix it?

 

 

On 12/2/16, 2:19 PM, "sanjaypujare" <g...@git.apache.org> wrote:

 

    GitHub user sanjaypujare opened a pull request:

    

https://github.com/apache/apex-core/pull/426

    

APEXCORE-558 Change highlight color to red and implement quit command

    

@amberarrow could you review and merge pls?

    

You can merge this pull request into a Git repository by running:

    

$ git pull https://github.com/sanjaypujare/apex-core APEXCORE-558.sanjay

    

Alternatively you can review and apply these changes as the patch at:

    

https://github.com/apache/apex-core/pull/426.patch

    

To close this pull request, make a commit to your master/trunk branch

    with (at least) the following in the commit message:

    

This closes #426

    



    commit 2f3bf30e34fe24367663530133e20b3626784878

    Author: Sanjay Pujare <san...@datatorrent.com>

    Date:   2016-12-02T22:17:24Z

    

APEXCORE-558 Change highlight color to red and implement quit command

    



    



---

    If your project is set up for it, you can reply to this email and have your

    reply appear on GitHub as well. If your project does not have this feature

    enabled and wishes so, or if the feature is enabled but not working, please

    contact infrastructure at infrastruct...@apache.org or file a JIRA ticket

    with INFRA.

    ---

    



Re: "ExcludeNodes" for an Apex application

2016-11-30 Thread Sanjay Pujare
Yes, Ram explained to me that in practice this would be a useful feature for 
Apex devops who typically have no control over Hadoop/Yarn cluster.

On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:

This is a practical scenario where developers would be required to exclude
certain nodes as they might be required for some mission critical
applications. It would be good to have this feature.

I understand that Stram should not get into resourcing and still rely on
Yarn, however, as the App Master it should have the right to reject the
nodes offered by Yarn and request for other resources.

Regards,
Mohit

On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <sand...@datatorrent.com>
wrote:

> Apex has automatic blacklisting of the troublesome nodes, please take a
> look at the following attributes,
>
> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> FAILURES_FOR_BLACKLIST
>
> BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
>
> Thanks
>
>
>
> On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <r...@datatorrent.com>
> wrote:
>
> Not sure if this is what Milind had in mind but we often run into
> situations where the dev group
> working with Apex has no control over cluster configuration -- to make any
> changes to the cluster they need to
> go through an elaborate process that can take many days.
>
> Meanwhile, if they notice that a particular node is consistently causing
> problems for their
> app, having a simple way to exclude it would be very helpful since it 
gives
> them a way
> to bypass communication and process issues within their own organization.
>
> Ram
>
> On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <san...@datatorrent.com>
> wrote:
>
> > To me both use cases appear to be generic resource management use cases.
> > For example, a randomly rebooting node is not good for any purpose esp.
> > long running apps so it is a bit of a stretch to imagine that these 
nodes
> > will be acceptable for some batch jobs in Yarn. So such a node should be
> > marked “Bad” or Unavailable in Yarn itself.
> >
> > Second use case is also typical anti-affinity use case which ideally
> > should be implemented in Yarn – Milind’s example can also apply to
> non-Apex
> > batch jobs. In any case it looks like Yarn still doesn’t have it (
> > https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs it we
> > will need to do it ourselves.
> >
> > On 11/30/16, 10:39 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:
> >
> > But then, what's the solution to the 2 problem scenarios that Milind
> > describes ?
> >
> > Ram
> >
> > On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > san...@datatorrent.com>
> > wrote:
> >
> > > I think “exclude nodes” and such is really the job of the resource
> > manager
> > > i.e. Yarn. So I am not sure taking over some of these tasks in 
Apex
> > would
> > > be very useful.
> > >
> > > I agree with Amol that apps should be node neutral. Resource
> > management in
> > > Yarn together with fault tolerance in Apex should minimize the 
need
> > for
> > > this feature although I am sure one can find use cases.
> > >
> > >
> > > On 11/29/16, 10:41 PM, "Amol Kekre" <a...@datatorrent.com> wrote:
> > >
> > > We do have this feature in Yarn, but that applies to all
> > applications.
> > > I am
> > > not sure if Yarn has anti-affinity. This feature may be used,
> > but in
> > > general there is danger is an application taking over resource
> > > allocation.
> > > Another quirk is that big data apps should ideally be
> > node-neutral.
> > > This is
> > > a good idea, if we are able to carve out something where need
> is
> > app
> > > specific.
> > >
> > > Thks
> > > Amol
> > >
> > >
> > > On Tue, Nov 29, 

Re: "ExcludeNodes" for an Apex application

2016-11-30 Thread Sanjay Pujare
To me both use cases appear to be generic resource management use cases. For 
example, a randomly rebooting node is not good for any purpose esp. long 
running apps so it is a bit of a stretch to imagine that these nodes will be 
acceptable for some batch jobs in Yarn. So such a node should be marked “Bad” 
or Unavailable in Yarn itself.

Second use case is also typical anti-affinity use case which ideally should be 
implemented in Yarn – Milind’s example can also apply to non-Apex batch jobs. 
In any case it looks like Yarn still doesn’t have it 
(https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs it we will 
need to do it ourselves.

On 11/30/16, 10:39 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:

But then, what's the solution to the 2 problem scenarios that Milind
describes ?

Ram

On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> I think “exclude nodes” and such is really the job of the resource manager
> i.e. Yarn. So I am not sure taking over some of these tasks in Apex would
> be very useful.
>
> I agree with Amol that apps should be node neutral. Resource management in
> Yarn together with fault tolerance in Apex should minimize the need for
> this feature although I am sure one can find use cases.
>
>
> On 11/29/16, 10:41 PM, "Amol Kekre" <a...@datatorrent.com> wrote:
>
> We do have this feature in Yarn, but that applies to all applications.
> I am
> not sure if Yarn has anti-affinity. This feature may be used, but in
> general there is danger is an application taking over resource
> allocation.
> Another quirk is that big data apps should ideally be node-neutral.
> This is
> a good idea, if we are able to carve out something where need is app
> specific.
>
> Thks
> Amol
>
>
> On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <mili...@gmail.com>
> wrote:
>
> > We have seen 2 cases mentioned below, where, it would have been nice
> if
> > Apex allowed us to exclude a node from the cluster for an
> application.
> >
> > 1. A node in the cluster had gone bad (was randomly rebooting) and
> so an
> > Apex app should not use it - other apps can use it as they were
> batch jobs.
> > 2. A node is being used for a mission critical app (Could be an Apex
> app
> > itself), but another Apex app which is mission critical should not
> be using
> > resources on that node.
> >
> > Can we have a way in which, Stram and YARN can coordinate between
> each
> > other to not use a set of nodes for the application. It an be done
> in 2 way
> > s-
> >
> > 1. Have a list of "exclude" nodes with Stram- when YARN allcates
> resources
> > on either of these, STRAM rejects and gets resources allocated again
> frm
> > YARN
> > 2. Have a list of nodes that can be used for an app - This can be a
> part of
> > config. Hwever, I don't think this would be a right way to do so as
> we will
> > need support from YARN as well. Further, this might be difficult to
> change
> > at runtim if need be.
> >
> > Any thoughts?
> >
> >
> > --
> > ~Milind bee at gee mail dot com
> >
>
>
>
>





Re: "ExcludeNodes" for an Apex application

2016-11-30 Thread Sanjay Pujare
I think “exclude nodes” and such is really the job of the resource manager i.e. 
Yarn. So I am not sure taking over some of these tasks in Apex would be very 
useful.

I agree with Amol that apps should be node neutral. Resource management in Yarn 
together with fault tolerance in Apex should minimize the need for this feature 
although I am sure one can find use cases.


On 11/29/16, 10:41 PM, "Amol Kekre"  wrote:

We do have this feature in Yarn, but that applies to all applications. I am
not sure if Yarn has anti-affinity. This feature may be used, but in
general there is danger is an application taking over resource allocation.
Another quirk is that big data apps should ideally be node-neutral. This is
a good idea, if we are able to carve out something where need is app
specific.

Thks
Amol


On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve  wrote:

> We have seen 2 cases mentioned below, where, it would have been nice if
> Apex allowed us to exclude a node from the cluster for an application.
>
> 1. A node in the cluster had gone bad (was randomly rebooting) and so an
> Apex app should not use it - other apps can use it as they were batch 
jobs.
> 2. A node is being used for a mission critical app (Could be an Apex app
> itself), but another Apex app which is mission critical should not be 
using
> resources on that node.
>
> Can we have a way in which, Stram and YARN can coordinate between each
> other to not use a set of nodes for the application. It an be done in 2 
way
> s-
>
> 1. Have a list of "exclude" nodes with Stram- when YARN allcates resources
> on either of these, STRAM rejects and gets resources allocated again frm
> YARN
> 2. Have a list of nodes that can be used for an app - This can be a part 
of
> config. Hwever, I don't think this would be a right way to do so as we 
will
> need support from YARN as well. Further, this might be difficult to change
> at runtim if need be.
>
> Any thoughts?
>
>
> --
> ~Milind bee at gee mail dot com
>





Re: Apex internal documentation.

2016-11-29 Thread Sanjay Pujare
+1.

Another thing we should look into is comments/code documentation in apex-core 
and apex-malhar. I think it can be significantly enhanced.

On 11/28/16, 9:53 PM, "Tushar Gosavi"  wrote:

Do we have documents explaining internals of Apex? Would it be a good
idea to add document explaining following on apex.apache.org to help
newcomers understand overall execution flow of application and apex
capabilities, this might also help them to add new features.

- Startup of application
  - logical plan formation and properties injection.
  - copying of jars/resources
  - handling of old application state.
- Master initialization
  - converting logical plan to physical plan
  - container assignments with locality and affinity
  - initial checkpoints
  - deployment
  - restarting from killed application (-originalId)
- StramChild functionality
  - Operator deploy requests and operator lifecycle management.
  - handling of checkpoints
  - WindowGenerator
  - Sending hearbeats
  - handling of various commands from Stram.
  - Operator request response model.
  - Bufferserver management. (purge/reset/disk spooling/back-pressure 
handling)
- Run-time activities at master.
  - heartbeat monitoring
  - stats listeners
  - recovery from operator failure
  - dynamical plan change

- Tushar.





Re: [jira] [Comment Edited] (APEXCORE-576) Apex/Malhar Extensions

2016-11-28 Thread Sanjay Pujare
Ananth: can you add these comments to the JIRA 
(https://issues.apache.org/jira/browse/APEXCORE-576) instead of following up on 
the email thread? Thanks



On 11/28/16, 2:08 PM, "Ananth G"  wrote:

If we were to model something along the lines of "Spark packages", I am
afraid that the story of Apex being a "rich ecosystem" might be diluted.

Today I see Malhar as a strong reason to convince someone that we can
interconnect many things as part of the offering. The moment we put a
layered process, my only concern is that the review process would be very
very loose  and hence lower quality creeping in as Apex committers need not
necessarily review such packages and hence provide inputs to the evolving
Apex core etc. There is a risk that things might diverge more than we want
them to ?


Thoughts ?


Regards,
Ananth

On Tue, Nov 29, 2016 at 6:54 AM, Sanjay M Pujare (JIRA) 
wrote:

>
> [ https://issues.apache.org/jira/browse/APEXCORE-576?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=15702868#comment-15702868 ]
>
> Sanjay M Pujare edited comment on APEXCORE-576 at 11/28/16 7:54 PM:
> 
>
> As far as I understand then the main enhancement here is the "registry"
> and a process to pick items from the registry to merge into the main
> repository and we will be using the current git process of pull-requests
> and merging across forks. Are there existing models we can look at for the
> registry?
>
> A few things that need to be hammered out:
> - registry access control, and format
> - criteria for merging external contributions into the main repo (quality,
> unit tests, demand, licensing)
> - level of automation available, or otherwise the manual process used to
> merge contributions
> - dependency matrix management (e.g. an item depends on malhar 3.5.0
> minimum)
> - guidelines for adding items to the registry to be followed by
> contributors
>
>
> was (Author: sanjaypujare):
> As far as I understand then the main enhancement here is the "registry"
> and a process to pick items from the registry to merge into the main
> repository and we will be using the current git process of pull-requests
> and merging across forks. Are there existing models we can look at for the
> registry?
>
> A few things that need to be hammered out:
> - registry access control, and format
> - criteria for merging external contributions into the main repo (quality,
> unit tests, demand, licensing)
> - level of automation available, or otherwise the manual process used to
> merge contributions
> - dependency matrix management (e.g. an item depends malhar 3.5.0 minimum)
> - guidelines for adding items to the registry to be followed by
> contributors
>
> > Apex/Malhar Extensions
> > --
> >
> > Key: APEXCORE-576
> > URL: https://issues.apache.org/jira/browse/APEXCORE-576
> > Project: Apache Apex Core
> >  Issue Type: New Feature
> >  Components: Website
> >Reporter: Chinmay Kolhatkar
> >
> > The purpose of this task is to provide a way to external contributors to
> make better contributions to Apache Apex project.
> > The idea looks something like this:
> > 1. One could have extension to apex core/malhar in their own repository
> and just register itself with Apache Apex.
> > 2. As it matures and find more and more use we can consume that in
> mainstream releases.
> > 3. Some possibilities of of Apex extensions are as follows:
> > a. Operators - DataSources, DataDestinations, Parsers, Formatters,
> Processing etc.
> > b. Apex Engine Plugables
> > c. External Integrations
> > d. Integration with other platform like Machine learning, Graph
> engines etc.
> > e. Application which are ready to use.
> > d. Apex related tools which can ease the development and usage of
> apex.
> > The initial discussion about this has happened here:
> > https://lists.apache.org/thread.html/3d6ca2b46c53df77f37f54d64e1860
> 7a623c5a54f439e1afebfbef35@%3Cdev.apex.apache.org%3E
> > More concrete discussion/implementation proposal required for this task
> to progress.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>





Re: Proposing a new feature to persist logical and physical plan snapshots in HDFS

2016-11-23 Thread Sanjay Pujare
Okay, but this “state” is gone after the app is “dead” isn’t that true? Also 
the reason for this enhancement is debuggability/troubleshooting of Apex apps 
so it is good to have separate explicit user visible files that contain the 
plan information instead of overloading the state for this purpose (in my 
opinion).

In terms of on-demand, it sounds like a good idea - I didn’t think of it. But I 
would like to drill down the use cases. In most cases, logical/physical plan 
changes are spontaneous or rather internal to the app so an external entity 
making a REST call to save the plan on demand might not sync up with when the 
plan changes took place inside the app. So saving the plan JSON files on events 
described previously seems to be the most efficient thing to do (as discussed 
with @Ashwin Putta) but if there are use cases I think it is a good idea to do 
it on demand as well.

On 11/23/16, 3:00 PM, "Amol Kekre" <a...@datatorrent.com> wrote:

Good idea. Stram does save state, and maybe a script that translates may
work. But explicit plan saving is also a good idea. Could this be "on
demand"? a rest call that writes out the plan(s) to specifid hdfs files?

We could do both (write on any change/set call) and/or on-demand.

Thks
Amol


On Wed, Nov 23, 2016 at 2:40 PM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> To help Apex developers/users with debugging or troubleshooting “dead”
> applications, I am proposing a new feature to persist logical and physical
> plan snapshots in HDFS.
>
>
>
> Similar to how the Apex engine persists container data per application
> attempt in HDFS as containers_NNN.json (where NNN is 1 for first app
> attempt, 2 for the second app attempt and so on), we will create 2 more
> sets of files under the …/apps/{appId} directory for an application:
>
>
>
> logicalPlan_NNN_MMM.json
>
> physicalPlan_NNN_MMM.json
>
>
>
> where NNN stands for the app attempt index (similar to NNN above 1, 2, 3
> and so on) and MMM is a running index starting at 1 which stands for a
> snapshot within an app attempt. Note that a logical or physical plan may
> change within an app attempt for any number of reasons.
>
>
>
> The StreamingContainerManager class maintains the current logical/physical
> plans in the “plan” member variable. New methods will be added in
> StreamingContainerManager to save the logical or physical plan as JSON
> representations in the app directory (as described above). The logic is
> similar to 
com.datatorrent.stram.webapp.StramWebServices.getLogicalPlan(String)
> and com.datatorrent.stram.webapp.StramWebServices.getPhysicalPlan() used
> inside the Stram Web service. There will be running indexes in
> StreamingContainerManager to keep track of MMM for the logical plan and
> physical plan. The appropriate save method will be called on the 
occurrence
> of any event that updates the logical or physical plan for example:
>
>
>
> inside com.datatorrent.stram.StreamingContainerManager.
> LogicalPlanChangeRunnable.call()  for logical plan change event
>
>
>
> inside 
com.datatorrent.stram.plan.physical.PhysicalPlan.redoPartitions(PMapping,
> String) for physical plan change event (i.e. redoing partitioning)
>
>
>
> Once these files are created, any user or a tool (such as the Apex CLI or
> the DT Gateway) can look up these files for troubleshooting/researching of
> “dead” applications and significant events in their lifetime in terms of
> logical or physical plan changes. Pls send me your feedback.
>
>
>
> Sanjay
>
>
>
>





Proposing a new feature to persist logical and physical plan snapshots in HDFS

2016-11-23 Thread Sanjay Pujare
To help Apex developers/users with debugging or troubleshooting “dead” 
applications, I am proposing a new feature to persist logical and physical plan 
snapshots in HDFS.

 

Similar to how the Apex engine persists container data per application attempt 
in HDFS as containers_NNN.json (where NNN is 1 for first app attempt, 2 for the 
second app attempt and so on), we will create 2 more sets of files under the 
…/apps/{appId} directory for an application:

 

logicalPlan_NNN_MMM.json

physicalPlan_NNN_MMM.json

 

where NNN stands for the app attempt index (similar to NNN above 1, 2, 3 and so 
on) and MMM is a running index starting at 1 which stands for a snapshot within 
an app attempt. Note that a logical or physical plan may change within an app 
attempt for any number of reasons.

 

The StreamingContainerManager class maintains the current logical/physical 
plans in the “plan” member variable. New methods will be added in 
StreamingContainerManager to save the logical or physical plan as JSON 
representations in the app directory (as described above). The logic is similar 
to com.datatorrent.stram.webapp.StramWebServices.getLogicalPlan(String) and 
com.datatorrent.stram.webapp.StramWebServices.getPhysicalPlan() used inside the 
Stram Web service. There will be running indexes in StreamingContainerManager 
to keep track of MMM for the logical plan and physical plan. The appropriate 
save method will be called on the occurrence of any event that updates the 
logical or physical plan for example:

 

inside 
com.datatorrent.stram.StreamingContainerManager.LogicalPlanChangeRunnable.call()
  for logical plan change event

 

inside 
com.datatorrent.stram.plan.physical.PhysicalPlan.redoPartitions(PMapping, 
String) for physical plan change event (i.e. redoing partitioning)

 

Once these files are created, any user or a tool (such as the Apex CLI or the 
DT Gateway) can look up these files for troubleshooting/researching of “dead” 
applications and significant events in their lifetime in terms of logical or 
physical plan changes. Pls send me your feedback.

 

Sanjay

 



Re: Adding new log4j appender to Apex core

2016-11-22 Thread Sanjay Pujare
The only way to “enforce” this new appender is to update the archetypes 
(apex-app-archetype and apex-conf-archetype under apex-core/ )  to use the new 
ones as default. But there does not seem to be a way to enforce this for anyone 
not using the archetypes.

I agree with not relying on ~/.dt in apex-core. 

On 11/22/16, 1:08 PM, "Thomas Weise"  wrote:

The log4j configuration is ultimately owned by the user, so how do you want
to enforce a custom appender?

I don't think that this should rely on anything in ~/.dt either

Thomas

On Tue, Nov 22, 2016 at 10:00 AM, Priyanka Gugale 
wrote:

> Hi,
>
> I am working on APEXCORE-563
> 
> As per this Jira we should put log file name in container/operator events.
> The problem is current RollingFileAppender keeps renaming files from 1 to 
2
> to ... n as files reach maximum allowed file size. Because of constant
> renaming of files we can't put a fixed file name in stram event.
>
> To overcome this I would like to add a new log4j appender to ApexCore.
> There are two ways I can implement this:
> 1. Have Daily rolling file appender. The current file will be recognized
> based on timestamp in file name. Also to control max file size, we need to
> keep rolling files based on size as well.
> 2. Have Rolling File Appender but do not rename files. When max file size
> is reached create new file with next number e.g. crate log file dt.log.2
> after dt.log.1 is full. Also to recognize the latest file keep the 
softlink
> named dt.log pointing to current log file.
>
> I would prefer to implement approach 2. Please provide your
> comments/feedback if you feel otherwise.
>
> Also to use this new appender we need to use our custom log4j.properties
> file instead of one present in hadoop conf. For that we need to set jvm
> option -Dlog4j.configuration. I am planning to update file dt-site.xml in
> folder ~/.dt  and default properties file available in apex archetype to
> set jvm options as follows:
>  
>dt.attr.CONTAINER_JVM_OPTIONS
>-Dlog4j.configuration=log4j.props
>  
>
> And I will copy log4j.props file in ~/.dt folder as well as
> apex-archetypes.
>
> Lastly if someone still miss this new log4j properties file or jvm option
> to set -Dlog4j.configuration we will not put log file name in event raised
> by container or operator.
>
> Please provide your feedback on this approach.
>
> -Priyanka
>





Re: Proposing an operator for log parsing.

2016-11-17 Thread Sanjay Pujare
+1, I like this feature.

On 11/17/16, 3:26 AM, "Shraddha Jog"  wrote:

Dear community,

We would like to add operator in malhar for parsing different types of
logs.
Idea of this operator is to read log data records of known formats such as
Syslog, common log, combined log, extended log etc from the upstream in a
DAG, parse/validate it based on the configured format and emit the
validated POJO to the downstream.

We are not targeting log formats from particular library as such but the
default formats for common log, combined log, extended log and sys log.
Also if user has some specific log format then that could be provided in a
property and operator will parse the log according to the given format.

Properties:
LogFileFormat : Property to define data format for the log data record
being read at the Input Port. It can be either from the above four default
log formats or a json specifying fields and regular expression. More
details can be found in the document.

Ports :
1. ParsedLog: This port shall emit the parsed/validated POJO object created
based on the log format configured by the user.

2. ErrorPort: This port shall emit the error log data record.

Proposed design can be found here


.

Thanks,
Shraddha

-- 
This e-mail, including any attached files, may contain confidential and 
privileged information for the sole use of the intended recipient. Any 
review, use, distribution, or disclosure by others is strictly prohibited. 
If you are not the intended recipient (or authorized to receive information 
for the intended recipient), please contact the sender by reply e-mail and 
delete all copies of this message.






Re: Visitor API for DAG

2016-11-17 Thread Sanjay Pujare
There is a risk if the user written code blocks the thread or crashes the 
process. What are the real life examples of this use case?


On 11/17/16, 9:21 AM, "amol kekre"  wrote:

+1. Opening up the API for users to put in their own code is good. In
general we should enable users to register their code in a lot of scenerios.

Thks
Amol

On Thu, Nov 17, 2016 at 9:06 AM, Tushar Gosavi 
wrote:

> Yes, It could happen after current DAG validation and before the
> application master is launched.
>
> - Tushar.
>
>
> On Thu, Nov 17, 2016 at 8:32 PM, Munagala Ramanath 
> wrote:
> > When would the visits happen ? Just before normal validation ?
> >
> > Ram
> >
> > On Wed, Nov 16, 2016 at 9:50 PM, Tushar Gosavi 
> wrote:
> >
> >> Hi All,
> >>
> >> How about adding visitor like API for DAG in Apex, and an api to
> >> register visitor for the DAG.
> >> Possible use cases are
> >> -  Validator visitor which could validate the dag
> >> -  Visitor to inject properties/attribute in the operator/streams from
> >> some external sources.
> >> -  Platform does not support validation of individual operators.
> >> developer could write a validator visitor which would call validate
> >> function of operator if it implements Validator interface.
> >> - generate output schema based on operator config and input schema,
> >> and set the schema on output stream.
> >>
> >> Sample API :
> >>
> >> dag.registerVisitor(DAGVisitor visitor);
> >>
> >> Call order of visitorFunctions.
> >> - preVisitDAG(Attributes) // dag attributes
> >>   for all operators
> >>   - visitOperator(OperatorMeta meta) // access to operator, name,
> >> attributes, properties
> >>  ports
> >>   - visitStream(StreamMeta meta) // access to
> >> stream/name/attributes/properties/ports
> >> - postVisitDAG()
> >>
> >> Regards,
> >> -Tushar.
> >>
>





Re: Proposal for apex/malhar extensions

2016-11-16 Thread Sanjay Pujare
+1 for the idea. Will be good to describe the registration mechanism to be used.

On 11/16/16, 3:17 AM, "Chinmay Kolhatkar"  wrote:

Dear Community,

This is in relation to malhar cleanup work that is ongoing.

In one of the talks during Apache BigData Europe, I got to know about
Spark-Packages (https://spark-packages.org/) (I believe lot of you must be
aware of it).
Spark package is basically functionality over and above and using Spark
core functionality. The spark packages can initially present in someone's
public repository and one could register that with
https://spark-packages.org/ and later on as it matures and finds more use,
it gets consumed in mainstream Spark repository and releases.

I found this idea quite interesting to keep our apex-malhar releases
cleaner.

One could have extension to apex-malhar in their own repository and just
register itself with Apache Apex. As it matures and find more and more use
we can consume that in mainstream releases.
Advantages to this are multiple:
1. The entry point for registering extensions with Apache Apex can be
minimal. This way we get more indirect contributions.
2. Faster way to add more feature in the project.
3. We keep our releases cleaner.
4. One could progress on feature-set faster balancing both Apache Way as
well as their own Enterprise Interests.

Please share your thoughts on this.

Thanks,
Chinmay.





Re: (APEXMALHAR-2340) Initialize the list of JdbcFieldInfo in JdbcPOJOInsertOutput operator from properties.xml

2016-11-15 Thread Sanjay Pujare
+1 for standardized JSON based mapping/schema definitions

On Mon, Nov 14, 2016 at 11:15 PM, Priyanka Gugale  wrote:

> +1 for having json based input for mappings.
>
> -Priyanka
>
> On Mon, Nov 14, 2016 at 11:21 PM, Devendra Tagare <
> devend...@datatorrent.com
> > wrote:
>
> > Hi,
> >
> > CSV schemas formats are based on delimited schemas which are meant to be
> > sequence sensitive ref : DelimitedSchema
> >  > contrib/src/main/java/com/datatorrent/contrib/parser/
> DelimitedSchema.java>
> > and
> > don't have a notion of input to output field mappings.
> >
> > Field info mappings for output operators are typically are a of the form
> -
> > destFieldName:pojoFieldName:type/supportType and are not intended to be
> > sequence sensitive.
> >
> > We can go with a JSON based structure which maps sources to destinations
> > with their respective types,
> >
> > {
> >   "destinationFieldName": "destination field name",
> >   "destType" : "support type, type",
> >   "srcFieldName" : "source pojo field name",
> >   "srcType" : "support type, type",
> >   "constraints" : "constraint expression"
> > }
> >
> > Thanks,
> > Dev
> >
> >
> >
> >
> >
> > Thanks,
> > Dev
> >
> > On Mon, Nov 14, 2016 at 9:27 AM, Ashwin Chandra Putta <
> > ashwinchand...@gmail.com> wrote:
> >
> > > Hitesh,
> > >
> > > We should standardize the schema definition across apex for individual
> > > operators and tuple classes.
> > >
> > > I think you should be able to use schema definition for CSV parser
> > without
> > > the delimiter.
> > >
> > > Regards,
> > > Ashwin.
> > >
> > > On Nov 14, 2016 2:45 AM, "Hitesh Kapoor" 
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Currently in JdbcPOJOInsertOuput operator we cannot configure
> > > JdbcFieldInfo
> > > > via properties.xml and the user has to do the necessary coding in his
> > > > application.
> > > >
> > > > To implement this improvement, the approach mentioned in
> > > > http://docs.datatorrent.com/application_packages/#
> operator-properties
> > > > could
> > > > be followed.
> > > > Now we need to provide the user a format for specifying the value of
> > > > fieldInfo.
> > > >
> > > > Kindly let me know which of the following is the best format to be
> used
> > > for
> > > > this
> > > > 1) CSV string (or any delimited string) with values for data members
> of
> > > > JdbcFieldInfo in a fixed sequence.
> > > > 2) JSON format with appropriate mapping.
> > > > 3) XML format with appropriate name tags and values.
> > > >
> > > > Regards,
> > > > Hitesh
> > > >
> > >
> >
>


Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Sanjay Pujare
I have interest to contribute – not so sure about time but I’ll try to work it 
out…

On 10/17/16, 10:01 AM, "Thomas Weise"  wrote:

FYI, the first cut of Apex runner for Beam is in a feature branch.

https://github.com/apache/incubator-beam/tree/apex-runner

The next step will be to detail the work needed to make it a top-level
runner and lay it out in JIRA.

We will also be looking for folks that are interested to contribute, so if
you have time and interest, please raise your hand!

Thanks,
Thomas

-- Forwarded message --
From: Kenneth Knowles 
Date: Mon, Oct 17, 2016 at 9:51 AM
Subject: [KUDOS] Contributed runner: Apache Apex!
To: "d...@beam.incubator.apache.org" 


Hi all,

I would to, once again, call attention to a great addition to Beam: a
runner for Apache Apex.

After lots of review and much thoughtful revision, pull request #540 has
been merged to the apex-runner feature branch today. Please do take a look,
and help us put the finishing touches on it to get it ready for the master
branch.

And please also congratulate and thank Thomas Weise for this large
endeavor, Vlad Rosov who helped get the integration tests working, and
Guarav Gupta who contributed review comments.

Kenn





Re: EmbeddedKafka Server for testing kafka related code

2016-10-13 Thread Sanjay Pujare
Strong +1 for using embedded kafka. Found embedded servers very useful in 
ActiveMQ and SQS (ElasticMQ) unit tests. My only question/concern is how well 
can you simulate different versions of Kafka.

On 10/13/16, 7:25 AM, "Chinmay Kolhatkar"  wrote:

Dear Community,

While testing kafka related endpoint code for Apex-Calcite integration, I
found an interesting embedded kafka implementation over the internet and I
used it in my code.

The implementation is based on ready to use EmbeddedZookeer and KafkaServer
from mvn dependency kafka_2.11:0.9.0.0:test.

I found this local kafka server very stable for testing purpose. Would it
be a good addition to malhar-kafka module in test folder?

The EmbeddedKafka server is here:

https://github.com/chinmaykolhatkar/apex-malhar/blob/calcite/sql/src/test/java/org/apache/apex/malhar/sql/EmbeddedKafka.java

In above class, I've also added method for publishing and consume from
topic.

Please share your opinion.

Thanks,
Chinmay.





Re: can operators emit on a different from the operator itself thread?

2016-10-12 Thread Sanjay Pujare
A JIRA has been created for adding this thread affinity check
https://issues.apache.org/jira/browse/APEXCORE-510 . I have made this
enhancement in a branch
https://github.com/sanjaypujare/apex-core/tree/malhar-510.thread_affinity
and I have been benchmarking the performance with this change. I will be
publishing the results in the above JIRA where we can discuss them and
hopefully agree on merging this change.

On Thu, Aug 11, 2016 at 1:41 PM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> You are right, I was subconsciously thinking about the THREAD_LOCAL case
> with a single container and a simple DAG and in that case Vlad’s assumption
> might not be valid but may be it is.
>
> On 8/11/16, 11:47 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:
>
> If I understand Vlad correctly, what he is saying is that each operator
> saves currentThread in
> its own setup() and checks it in its own output methods. The threads in
> different operators are
> running potentially on different nodes and/or processes and there will
> be
> no connection between them.
>
> Ram
>
> On Thu, Aug 11, 2016 at 11:41 AM, Sanjay Pujare <
> san...@datatorrent.com>
> wrote:
>
> > Name check is expensive, agreed, but there isn’t anything else
> currently.
> > Ideally the stram engine (considering that it is an engine providing
> > resources like threads etc) should use a ThreadFactory or a
> ThreadGroup to
> > create operator threads so identification and adding functionality is
> > easier.
> >
> > The idea of checking for the same thread between setup() and emit()
> won’t
> > work because the emit() check will have to be in the Sink hierarchy
> and
> > AFAIK a Sink object doesn’t have access to the corresponding
> operator,
> > right? Another more fundamental problem probably is that these
> threads
> > don’t have to match. The emit() for any operator (or rather a Sink
> related
> > to an operator) is ultimately triggered by an emitTuple() on the
> topmost
> > input operator in that path which happens in that input operator’s
> thread
> > which doesn’t have to match the thread calling setup() in the
> downstream
> > operators, right?
> >
> >
> > On 8/11/16, 10:59 AM, "Vlad Rozov" <v.ro...@datatorrent.com> wrote:
> >
> > Name verification is too expensive, it will be sufficient to
> store
> > currentThread during setup() and verify that it is the same
> during
> > emit.
> > Checks should be supported not only for DefaultOutputPort, so we
> may
> > have it implemented in various Sinks.
> >
> > Vlad
> >
> > On 8/11/16 10:21, Sanjay Pujare wrote:
> > > Thinking more about this – all of the “operator” threads are
> created
> > by the Stram engine with appropriate names. So we can put checks in
> the
> > DefaultOutputPort.emit() or in the various implementations of
> Sink.put()
> > that the current-thread is one created by the Stram engine (by
> verifying
> > the name).
> > >
> > > We can even use a special Thread object for operator threads
> so the
> > above detection is easier.
> > >
> > >
> > >
> > > On 8/10/16, 6:11 PM, "Amol Kekre" <a...@datatorrent.com>
> wrote:
> > >
> > >  +1 on debug proposal. Even if tuples lands up within the
> > window, it breaks
> > >  all guarantees. A rerun (after restart from a checkpoint)
> can
> > have tuples
> > >  in different windows from this thread. A separate thread
> simply
> > exposes
> > >  users to unwarranted risks.
> > >
> > >  Thks
> > >  Amol
> > >
> > >
> > >  On Wed, Aug 10, 2016 at 6:05 PM, Vlad Rozov <
> > v.ro...@datatorrent.com> wrote:
> > >
> > >  > Tuples emitted between end and begin windows is only
> one of
> > possible
> > >  > behaviors that emitting tuples on a separate from the
> > operator thread may
> > >  > introduce. It will be good to have both checks in place
> at
> > run-time and if
> > >  > checking for the operator thread for every emitted
> tuple is
> > too e

Re: Test coverage tool

2016-10-05 Thread Sanjay Pujare
I like the idea but would like to know the criteria for flagging a pull
request. Will you compare overall test coverage percentage or per class
coverage percentage?

On Thu, Oct 6, 2016 at 3:52 AM, Chandni Singh  wrote:

> Hi,
>
> Tim and I spoke sometime back that it will be helpful if we can have a test
> coverage tool which flags a pull request if it reduces the test coverage
> percentage than the previous build.
>
> Recently while making improvements to ManagedState (APEXMALHAR-2223
> ), I came across a
> class 'ManagedTimeStateMultiValue'. This class has no tests. Some part of
> this class is tested under join operator but the majority of it is not
> covered by unit tests.
>
> This is just an example and apologies for highlighting it here but I think
> having a test coverage tool whose results are compared with the last build
> can help us address these issues promptly.
>
> Let me know what everyone thinks and I can look into some plugins that we
> can use.
>
> Thanks,
> Chandni
>


Re: [VOTE] Hadoop upgrade

2016-10-05 Thread Sanjay Pujare
+1 for 2.6


On Wed, Oct 5, 2016 at 11:52 AM, Pradeep Kumbhar 
wrote:

> +1 for 2.6
>
> On Tue, Oct 4, 2016 at 8:42 PM, Munagala Ramanath 
> wrote:
>
> > +1 for 2.6.x
> >
> > Ram
> >
> > On Mon, Oct 3, 2016 at 1:47 PM, David Yan  wrote:
> >
> > > Hi all,
> > >
> > > Thomas created this ticket for upgrading our Hadoop dependency version
> a
> > > couple weeks ago:
> > >
> > > https://issues.apache.org/jira/browse/APEXCORE-536
> > >
> > > We'd like to get the ball rolling and would like to take a vote from
> the
> > > community which version we would like to upgrade to. We have these
> > choices:
> > >
> > > 2.2.0 (no upgrade)
> > > 2.4.x
> > > 2.5.x
> > > 2.6.x
> > >
> > > We are not considering 2.7.x because we already know that many Apex
> users
> > > are using Hadoop distros that are based on 2.6.
> > >
> > > Please note that Apex works with all versions of Hadoop higher or equal
> > to
> > > the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
> > > considering Hadoop 3.0.0-alpha yet at this time.
> > >
> > > When voting, please keep these in mind:
> > >
> > > - The features that are added in 2.4.x, 2.5.x, and 2.6.x respectively,
> > and
> > > how useful those features are for Apache Apex
> > > - The Hadoop versions the major distros (Cloudera, Hortonworks, MapR,
> > EMR,
> > > etc) are supporting
> > > - The Hadoop versions what typical Apex users are using
> > >
> > > Thanks,
> > >
> > > David
> > >
> >
>
>
>
> --
> *regards,*
> *~pradeep*
>


Re: Kudu store operators

2016-10-03 Thread Sanjay Pujare
+1

On Oct 3, 2016 5:33 PM, "Sandeep Deshmukh"  wrote:

> +1
>
> Regards,
> Sandeep
>
> On Mon, Oct 3, 2016 at 10:16 AM, Tushar Gosavi 
> wrote:
>
> > +1, It will be great to have this operator.
> >
> > - Tushar.
> >
> > On Mon, Oct 3, 2016 at 8:15 AM, Chinmay Kolhatkar
> >  wrote:
> > > +1.
> > >
> > > - Chinmay.
> > >
> > > On 3 Oct 2016 7:25 a.m., "Amol Kekre"  wrote:
> > >
> > >> Ananth,
> > >> This would be great to have. +1
> > >>
> > >> Thks
> > >> Amol
> > >>
> > >> On Sun, Oct 2, 2016 at 8:38 AM, Munagala Ramanath <
> r...@datatorrent.com>
> > >> wrote:
> > >>
> > >> > +1
> > >> >
> > >> > Kudu looks impressive from the overview, though it seems to still be
> > >> > maturing.
> > >> >
> > >> > Ram
> > >> >
> > >> >
> > >> > On Sat, Oct 1, 2016 at 11:42 PM, ananth 
> > wrote:
> > >> >
> > >> > > Hello All,
> > >> > >
> > >> > > I was wondering if it would be worthwhile for the community to
> > consider
> > >> > > support for Apache Kudu as a store ( as a contrib operator inside
> > >> Apache
> > >> > > Malhar ) .
> > >> > >
> > >> > > Here are some benefits I see:
> > >> > >
> > >> > > 1. Kudu is just declared 1.0 and has just been declared production
> > >> ready.
> > >> > > 2. Kudu as a store might a good a fit for many architectures in
> the
> > >> > >years to come because of its capabilities to provide mutability
> > of
> > >> > >data ( unlike HDFS ) and optimized storage formats for scans.
> > >> > > 3. It seems to also withstand high-throughput write patterns which
> > >> > >makes it a stable sink for Apex workflows which operate at very
> > high
> > >> > >volumes.
> > >> > >
> > >> > >
> > >> > > Here are some links
> > >> > >
> > >> > >  *  From the recent Strata conference
> > >> > >https://kudu.apache.org/2016/09/26/strata-nyc-kudu-talks.html
> > >> > >  * https://kudu.apache.org/overview.html
> > >> > >
> > >> > > I can implement this operator if the community feels it is worth
> > adding
> > >> > it
> > >> > > to our code base. If so, could someone please assign the JIRA to
> > me. I
> > >> > have
> > >> > > created this JIRA to track this : https://issues.apache.org/jira
> > >> > > /browse/APEXMALHAR-2278
> > >> > >
> > >> > >
> > >> > > Regards,
> > >> > >
> > >> > > Ananth
> > >> > >
> > >> > >
> > >> >
> > >>
> >
>


Re: Testing operators / CI

2016-09-12 Thread Sanjay Pujare
Most of this discussion applies only to processing operators (non-I/O 
operators), right? 

I/O operators have to be tested with their respective endpoint (e.g. ActiveMQ 
operator with the ActiveMQ broker) and the effort of developing a mock and 
testing with it, is not worth the effort needed. So how do we get better 
coverage for I/O operators?


On 9/12/16, 6:02 PM, "Thomas Weise"  wrote:

Yes, I suggested that. Looking for volunteers to tackle these things.


On Mon, Sep 12, 2016 at 5:44 PM, Pramod Immaneni 
wrote:

> I agree but I think it will also help if we provide more tools in this
> space like providing an operator test driver that goes through the
> lifecycle methods of an operator and offers configurability and 
variations.
> This driver could be bootstrapped from the unit test. I see the setup,
> beginWindow, process, endWindow and teardown call pattern repeated in many
> unit tests and this can expand to more methods when operator implements
> more interfaces.
>
> Thanks
>
> On Mon, Sep 12, 2016 at 5:26 PM, Thomas Weise  wrote:
>
> > Hi,
> >
> > Recently there was a bit of discussion on how to write tests for
> operators
> > that will result in good coverage and high confidence in the results of
> the
> > CI. Experience from past releases show that those operators with good
> > coverage are less likely to break down (with a user) due to subsequent
> > changes, while those that don't have coverage in the CI (think contrib)
> are
> > likely to suffer breakdown even due to trivial changes that are 
otherwise
> > easily caught.
> >
> > IMO writing good tests is as important as the operator main code (and
> > documentation and examples..). It was also part of the maturity 
framework
> > that Ashwin proposed a while ago (Ashwin, maybe you can also share a few
> > points). I suggest we expand the contribution guidelines to reflect an
> > agreed set of expectations that contributors can follow when submitting
> PRs
> > or even come up with a checklist for submitting PRs:
> >
> > http://apex.apache.org/malhar-contributing.html
> >
> > Here are a few recurring problems and suggestions in nor particular
> order:
> >
> >- Unit tests are for testing small pieces of code in isolation
> ("unit").
> >Running a DAG in embedded mode is not a unit test, it is an
> integration
> >test.
> >- When writing an operator or making changes to fix bugs etc., it is
> >recommended to write or modify the granular test that exercises this
> > change
> >and as little as possible around it. This happens before writing or
> > running
> >an application and can be done in fast iterations inside the IDE
> without
> >extensive test data setup or application assembly.
> >- When an operator consists of multiple other components, then 
testing
> >for those should also be broken down into units. For example, managed
> > state
> >is not tested by testing dedup or join operator (which are special 
use
> >cases), but through separate tests, that exercise the full spectrum
> (or
> > at
> >least close to) of managed state.
> >- So what about serialization, don't I need to create a DAG to test
> it?
> >You only need Kryo to test serialization of an operator. Use the
> > existing
> >utilities or contribute to utilities that are shared between tests.
> >- Don't I need to run a DAG to test the lifecycle of an operator? No,
> >the sequence of calls to an operator's lifecycle methods are
> documented
> > (or
> >how else would I implement an operator to start with). There are
> quite a
> >few tests that "execute" the operator directly. They have access to
> the
> >state and can assert that with a certain process invocation the
> expected
> >changes occur. That is much more difficult when running a DAG.
> >- I have to write a lot of code to do such testing and possibly I 
will
> >forget some calls? Not when following test driven development. IMO
> that
> >mostly happens when tests are written as afterthought and that's a
> > waste of
> >time. I would suggest though to develop a single operator test driver
> > that
> >will ensures all methods are called for basic sanity check.
> >- Integration tests: with proper unit test coverage, the integration
> >test is more like an example of how to use an operator. Nice for
> users,
> >because they can use it as a starting point for writing their own 
app,
> >including the configuration.
> >- I wrote a nice integration test app with 

Planning to add an InputOperator for gRPC and Protobuf

2016-08-15 Thread Sanjay Pujare
I am thinking of adding an input operator to Apex Malhar that allows gRPC based 
message streams to be consumed by an Apex system.

 

gRPC (http://www.grpc.io/posts/principles) is a recent open source RPC 
framework that started at Google and is becoming popular. It is typically used 
with Protobuf (a serialization framework also developed at Google, see 
https://developers.google.com/protocol-buffers/docs/overview).

 

In this proposal I will create an AbstractGrpcInputOperator that will behave 
somewhat like the Http input operator in the sense that it will generate a 
request to the Grpc service and will process the response to parse the 
individual messages and emit tuples. Of course the operator will have support 
for idempotency and exception handling. We will also try to add support for 
partitionability and dynamic scalability based on their applicability to the 
Grpc input operator. Similarly we will opportunistically add support for Client 
interceptors 
(http://www.grpc.io/grpc-java/javadoc/io/grpc/ClientInterceptor.html) and other 
gRPC usage models (e.g. unary vs streaming).

 

A developer uses the “protoc” compiler and an input “proto” file to generate 
Java classes that define the client “stubs” and serialized message classes that 
correspond to the RPC definition in the proto file. Hence 
AbstractGrpcInputOperator is a generic class requiring the request and response 
type arguments:

 

abstract class AbstractGrpcInputOperator 

 

All Protobuf (version 3) generated protocol message classes extend class 
com.google.protobuf.GeneratedMessageV3. This class implements most of the 
Message and Builder interfaces using Java reflection. 

 

The operator also needs an “AbstractStub” instance that is generated by 
“protoc”. AbstractStub is the common base type for client stub implementations. 
It encapsulates things such as remote host+port, TLS vs TCP transport, and 
trust store in case of TLS.

 

The operator also needs a MethodDescriptor object (which encapsulates the name 
of the operation to execute as well as Marshaller instances used to parse and 
serialize request and response messages) and a RequestType object that contains 
the RPC/Request arguments.

 

The operator will create a separate thread to asynchronously post gRPC requests 
in an infinite loop and the same thread will process the response for received 
messages (ResponseType objects). These ResponseType objects will be added to an 
ArrayBlockingQueue and the emitTuple() will read this queue to generate the 
tuples (similar to the logic in AbstractJMSInputOperator of Malhar).

 

The class will go in the package org.apache.apex.malhar.lib.io.grpc . User will 
need to subclass this class and provide the actual types for RequestType and 
ResponseType as well as the properties described above.

 

Comments/feedback welcome.

 

Sanjay

 



Re: auto-generated emails

2016-08-12 Thread Sanjay Pujare
I recommend leaving JIRA untouched and only turn off PR comments to the dev@ 
mailing list.

On 8/11/16, 6:16 PM, "Thomas Weise"  wrote:

I'm going to create a ticket to turn off pull request comments to the dev@
mailing list.

In addition to that, we have also PR comments attached to JIRA:

https://reference.apache.org/pmc/github

The JIRA comments are also sent to the mailing list. So we could either
turn off attaching PR comments to JIRA or turn off all JIRA comments to the
mailing list. It may be useful though to have JIRA comments other then PR
comments on the mailing list?



On Tue, Jul 26, 2016 at 1:05 AM, Chinmay Kolhatkar 
wrote:

> Sure Pramod. I'll take a look at it.
>
> On Tue, Jul 26, 2016 at 1:17 PM, Pramod Immaneni 
> wrote:
>
> > Chinmay,
> >
> > Do you want to investigate setting these up.
> >
> > Thanks
> >
> > On Mon, Jul 18, 2016 at 9:15 PM, Chinmay Kolhatkar <
> > chin...@datatorrent.com>
> > wrote:
> >
> > > Pramod,
> > >
> > > My Suggestion is on the same lines as what Yogi suggested, except
> instead
> > > of creating a seperate mailing list, send further mails to folks in 
the
> > > comments.
> > >
> > > -Chinmay.
> > >
> > >
> > > On Tue, Jul 19, 2016 at 3:07 AM, Pramod Immaneni <
> pra...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Chinmay are you suggesting the first mail only sent to folks
> mentioned
> > > and
> > > > not everyone in the list? I like Yogi's suggestion on having a
> separate
> > > > commits list where all commit emails go but initial emails are still
> > sent
> > > > to dev and dev can mostly focus on discussions.
> > > >
> > > > Anyone, wants to volunteer to investigate these options?
> > > >
> > > > Thanks
> > > >
> > > > On Fri, Jul 15, 2016 at 2:34 AM, Yogi Devendra <
> > > > devendra.vyavah...@gmail.com
> > > > > wrote:
> > > >
> > > > > +1 on reducing automated emails for comments on dev list.
> > > > >
> > > > > How about having separate mailing list such as commits@apex which
> > > would
> > > > > have full archive of comments?
> > > > >
> > > > > New JIRA, PR can be sent to both dev, commits.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 13 July 2016 at 10:22, Chinmay Kolhatkar <
> chin...@datatorrent.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Strongly +1 for this.
> > > > > >
> > > > > > For 1, can the mail be sent to someone who is mentioned in the
> PR?
> > So
> > > > for
> > > > > > e.g., if I mention @PramodSSImmaneni , then Pramod will be part
> of
> > PR
> > > > > email
> > > > > > notification all further back and forth communication for that
> PR.
> > > > > >
> > > > > > @Pramod, just using your name as an example.. :)
> > > > > >
> > > > > > - Chinmay.
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 13, 2016 at 3:18 AM, Munagala Ramanath <
> > > > r...@datatorrent.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Ram
> > > > > > >
> > > > > > > On Tue, Jul 12, 2016 at 2:35 PM, Pramod Immaneni <
> > > > > pra...@datatorrent.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I was wondering how everyone felt about the volume of
> > > > auto-generated
> > > > > > > emails
> > > > > > > > on this list. Looks like multiple emails are generated and
> sent
> > > to
> > > > > > > everyone
> > > > > > > > on the list even for relatively smaller actions such as
> > > commenting
> > > > > on a
> > > > > > > > pull request, one from git, another from JIRA etc.
> > > > > > > >
> > > > > > > > Understanding that there is a need for openness, how about
> > > finding
> > > > a
> > > > > > > > balance. Here are some ideas. I do not know if all of these
> are
> > > > > > > technically
> > > > > > > > feasible.
> > > > > > > >
> > > > > > > > 1. An email is sent to all in the list when a new pull
> request
> > is
> > > > > > created
> > > > > > > > or merged but email notifications for back and forth 
comments
> > > > during
> > > > > > the
> > > > > > > > review are only sent to participants in that particular pull
> > > > request.
> > > > > > > > 2. Similar process as above with JIRA. If someone is
> interested
> > > in
> > > > > all
> > > > > > > the
> > > > > > > > updates to JIRA, including those that come from the pull
> > request,
> > > > > they
> > > > > > > can
> > > > > > > > add 

Re: can operators emit on a different from the operator itself thread?

2016-08-12 Thread Sanjay Pujare
You are right, I was subconsciously thinking about the THREAD_LOCAL case with a 
single container and a simple DAG and in that case Vlad’s assumption might not 
be valid but may be it is.

On 8/11/16, 11:47 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:

If I understand Vlad correctly, what he is saying is that each operator
saves currentThread in
its own setup() and checks it in its own output methods. The threads in
different operators are
running potentially on different nodes and/or processes and there will be
no connection between them.

Ram

On Thu, Aug 11, 2016 at 11:41 AM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> Name check is expensive, agreed, but there isn’t anything else currently.
> Ideally the stram engine (considering that it is an engine providing
> resources like threads etc) should use a ThreadFactory or a ThreadGroup to
> create operator threads so identification and adding functionality is
> easier.
>
> The idea of checking for the same thread between setup() and emit() won’t
> work because the emit() check will have to be in the Sink hierarchy and
> AFAIK a Sink object doesn’t have access to the corresponding operator,
> right? Another more fundamental problem probably is that these threads
> don’t have to match. The emit() for any operator (or rather a Sink related
> to an operator) is ultimately triggered by an emitTuple() on the topmost
> input operator in that path which happens in that input operator’s thread
> which doesn’t have to match the thread calling setup() in the downstream
> operators, right?
>
>
> On 8/11/16, 10:59 AM, "Vlad Rozov" <v.ro...@datatorrent.com> wrote:
>
> Name verification is too expensive, it will be sufficient to store
> currentThread during setup() and verify that it is the same during
> emit.
> Checks should be supported not only for DefaultOutputPort, so we may
    > have it implemented in various Sinks.
>
> Vlad
>
> On 8/11/16 10:21, Sanjay Pujare wrote:
> > Thinking more about this – all of the “operator” threads are created
> by the Stram engine with appropriate names. So we can put checks in the
> DefaultOutputPort.emit() or in the various implementations of Sink.put()
> that the current-thread is one created by the Stram engine (by verifying
> the name).
> >
> > We can even use a special Thread object for operator threads so the
> above detection is easier.
> >
> >
> >
> > On 8/10/16, 6:11 PM, "Amol Kekre" <a...@datatorrent.com> wrote:
> >
> >  +1 on debug proposal. Even if tuples lands up within the
> window, it breaks
> >  all guarantees. A rerun (after restart from a checkpoint) can
> have tuples
> >  in different windows from this thread. A separate thread simply
> exposes
> >  users to unwarranted risks.
> >
> >  Thks
> >  Amol
> >
> >
> >  On Wed, Aug 10, 2016 at 6:05 PM, Vlad Rozov <
> v.ro...@datatorrent.com> wrote:
> >
> >  > Tuples emitted between end and begin windows is only one of
> possible
> >  > behaviors that emitting tuples on a separate from the
> operator thread may
> >  > introduce. It will be good to have both checks in place at
> run-time and if
> >  > checking for the operator thread for every emitted tuple is
> too expensive,
> >  > we may have it enabled only in DEBUG or mode with more checks
> in place.
> >  >
> >  > Vlad
> >  >
> >  >
> >  > Sanjay just reminded me of my typo -> I meant between
> end_window and
> >  >> start_window :)
> >  >>
> >  >> Thks
> >  >> Amol
> >  >>
> >  >> On Wed, Aug 10, 2016 at 2:36 PM, Sanjay Pujare <
> san...@datatorrent.com>
> >  >> wrote:
> >  >>
> >  >> If the goal is to do this validation through static analysis
> of operator
> >  >>> code, I guess it is possible but is going to be
> non-trivial. And there
> >  >>> could

Re: can operators emit on a different from the operator itself thread?

2016-08-11 Thread Sanjay Pujare
Name check is expensive, agreed, but there isn’t anything else currently. 
Ideally the stram engine (considering that it is an engine providing resources 
like threads etc) should use a ThreadFactory or a ThreadGroup to create 
operator threads so identification and adding functionality is easier.

The idea of checking for the same thread between setup() and emit() won’t work 
because the emit() check will have to be in the Sink hierarchy and AFAIK a Sink 
object doesn’t have access to the corresponding operator, right? Another more 
fundamental problem probably is that these threads don’t have to match. The 
emit() for any operator (or rather a Sink related to an operator) is ultimately 
triggered by an emitTuple() on the topmost input operator in that path which 
happens in that input operator’s thread which doesn’t have to match the thread 
calling setup() in the downstream operators, right?


On 8/11/16, 10:59 AM, "Vlad Rozov" <v.ro...@datatorrent.com> wrote:

Name verification is too expensive, it will be sufficient to store 
currentThread during setup() and verify that it is the same during emit. 
Checks should be supported not only for DefaultOutputPort, so we may 
have it implemented in various Sinks.

Vlad

On 8/11/16 10:21, Sanjay Pujare wrote:
> Thinking more about this – all of the “operator” threads are created by 
the Stram engine with appropriate names. So we can put checks in the 
DefaultOutputPort.emit() or in the various implementations of Sink.put() that 
the current-thread is one created by the Stram engine (by verifying the name).
>
> We can even use a special Thread object for operator threads so the above 
detection is easier.
>
>
>
> On 8/10/16, 6:11 PM, "Amol Kekre" <a...@datatorrent.com> wrote:
>
>  +1 on debug proposal. Even if tuples lands up within the window, it 
breaks
>  all guarantees. A rerun (after restart from a checkpoint) can have 
tuples
>  in different windows from this thread. A separate thread simply 
exposes
>  users to unwarranted risks.
>  
>  Thks
>  Amol
>  
>  
>  On Wed, Aug 10, 2016 at 6:05 PM, Vlad Rozov 
<v.ro...@datatorrent.com> wrote:
>  
>  > Tuples emitted between end and begin windows is only one of 
possible
>  > behaviors that emitting tuples on a separate from the operator 
thread may
>  > introduce. It will be good to have both checks in place at 
run-time and if
>  > checking for the operator thread for every emitted tuple is too 
expensive,
>  > we may have it enabled only in DEBUG or mode with more checks in 
place.
>  >
>  > Vlad
>  >
>  >
>  > Sanjay just reminded me of my typo -> I meant between end_window 
and
>  >> start_window :)
>  >>
>  >> Thks
>  >> Amol
>  >>
>  >> On Wed, Aug 10, 2016 at 2:36 PM, Sanjay Pujare 
<san...@datatorrent.com>
>  >> wrote:
>  >>
>  >> If the goal is to do this validation through static analysis of 
operator
>  >>> code, I guess it is possible but is going to be non-trivial. And 
there
>  >>> could be false positives and false negatives.
>  >>>
>  >>> Also I suppose this discussion applies to processor operators 
(those
>  >>> having both in and out ports) so Ram’s example of 
JdbcPollInputOperator
>  >>> may
>  >>> not be applicable here?
>  >>>
>  >>> On 8/10/16, 2:04 PM, "Ashwin Chandra Putta" 
<ashwinchand...@gmail.com>
>  >>> wrote:
>  >>>
>  >>>  In a separate thread I mean.
>  >>>
>  >>>  Regards,
>  >>>  Ashwin.
>  >>>
>  >>>  On Wed, Aug 10, 2016 at 2:01 PM, Ashwin Chandra Putta <
>  >>>  ashwinchand...@gmail.com> wrote:
>  >>>
>  >>>  > + dev@apex.apache.org
>  >>>  > - us...@apex.apache.org
>  >>>  >
>  >>>  > This is one of those best practices that we learn by 
experience
>  >>> during
>  >>>  > operator development. It will save a lot of time during 
operator
>  >>>  > development if we can catch and throw validation error 
when
>  >>> so

Re: can operators emit on a different from the operator itself thread?

2016-08-11 Thread Sanjay Pujare
Thinking more about this – all of the “operator” threads are created by the 
Stram engine with appropriate names. So we can put checks in the 
DefaultOutputPort.emit() or in the various implementations of Sink.put() that 
the current-thread is one created by the Stram engine (by verifying the name).

We can even use a special Thread object for operator threads so the above 
detection is easier.



On 8/10/16, 6:11 PM, "Amol Kekre" <a...@datatorrent.com> wrote:

+1 on debug proposal. Even if tuples lands up within the window, it breaks
all guarantees. A rerun (after restart from a checkpoint) can have tuples
in different windows from this thread. A separate thread simply exposes
users to unwarranted risks.

Thks
Amol


On Wed, Aug 10, 2016 at 6:05 PM, Vlad Rozov <v.ro...@datatorrent.com> wrote:

> Tuples emitted between end and begin windows is only one of possible
> behaviors that emitting tuples on a separate from the operator thread may
> introduce. It will be good to have both checks in place at run-time and if
> checking for the operator thread for every emitted tuple is too expensive,
> we may have it enabled only in DEBUG or mode with more checks in place.
>
> Vlad
>
>
> Sanjay just reminded me of my typo -> I meant between end_window and
>> start_window :)
>>
>> Thks
>> Amol
>>
>> On Wed, Aug 10, 2016 at 2:36 PM, Sanjay Pujare <san...@datatorrent.com>
>> wrote:
>>
>> If the goal is to do this validation through static analysis of operator
>>> code, I guess it is possible but is going to be non-trivial. And there
>>> could be false positives and false negatives.
>>>
>>> Also I suppose this discussion applies to processor operators (those
>>> having both in and out ports) so Ram’s example of JdbcPollInputOperator
>>> may
>>> not be applicable here?
>>>
>>> On 8/10/16, 2:04 PM, "Ashwin Chandra Putta" <ashwinchand...@gmail.com>
>>> wrote:
>>>
>>>  In a separate thread I mean.
>>>
>>>  Regards,
>>>  Ashwin.
>>>
>>>  On Wed, Aug 10, 2016 at 2:01 PM, Ashwin Chandra Putta <
>>>  ashwinchand...@gmail.com> wrote:
>>>
>>>  > + dev@apex.apache.org
>>>  > - us...@apex.apache.org
>>>  >
>>>  > This is one of those best practices that we learn by experience
>>> during
>>>  > operator development. It will save a lot of time during operator
>>>  > development if we can catch and throw validation error when
>>> someone
>>> emits
>>>  > tuples in a non separate thread.
>>>  >
>>>  > Regards,
>>>  > Ashwin
>>>  >
>>>  > On Wed, Aug 10, 2016 at 1:57 PM, Munagala Ramanath <
>>> r...@datatorrent.com>
>>>  > wrote:
>>>  >
>>>  >> For cases where use of a different thread is needed, it can 
write
>>> tuples
>>>  >> to a queue from where the operator thread pulls them --
>>>  >> JdbcPollInputOperator in Malhar has an example.
>>>  >>
>>>  >> Ram
>>>  >>
>>>  >> On Wed, Aug 10, 2016 at 1:50 PM, hsy...@gmail.com <
>>> hsy...@gmail.com
>>>  >> wrote:
>>>  >>
>>>  >>> Hey Vlad,
>>>  >>>
>>>  >>> Thanks for bringing this up. Is there an easy way to detect
>>> unexpected
>>>  >>> use of emit method without hurt the performance. Or at least if
>>> we
>>> can
>>>  >>> detect this in debug mode.
>>>  >>>
>>>  >>> Regards,
>>>  >>> Siyuan
>>>  >>>
>>>  >>> On Wed, Aug 10, 2016 at 11:27 AM, Vlad Rozov <
>>> v.ro...@datatorrent.com>
>>>  >>> wrote:
>>>  >>>
>>>  >>>> The short answer is no, creating worker thread to emit tuples
>>> is
>>> not
>>>  >>>> supported by Apex and will lead to an undefined behavior.
>>> Operators in Apex
>>>  >>>> have strong thread affinity and all interaction with the
>>> platform
>>> must
>>>  >>>> happen on the operator thread.
>>>  >>>>
>>>  >>>> Vlad
>>>  >>>>
>>>  >>>
>>>  >>>
>>>  >>
>>>  >
>>>  >
>>>  > --
>>>  >
>>>  > Regards,
>>>  > Ashwin.
>>>  >
>>>
>>>
>>>
>>>  --
>>>
>>>  Regards,
>>>  Ashwin.
>>>
>>>
>>>
>>>
>>>
>





Re: can operators emit on a different from the operator itself thread?

2016-08-10 Thread Sanjay Pujare
If the goal is to do this validation through static analysis of operator code, 
I guess it is possible but is going to be non-trivial. And there could be false 
positives and false negatives.

Also I suppose this discussion applies to processor operators (those having 
both in and out ports) so Ram’s example of JdbcPollInputOperator may not be 
applicable here?

On 8/10/16, 2:04 PM, "Ashwin Chandra Putta"  wrote:

In a separate thread I mean.

Regards,
Ashwin.

On Wed, Aug 10, 2016 at 2:01 PM, Ashwin Chandra Putta <
ashwinchand...@gmail.com> wrote:

> + dev@apex.apache.org
> - us...@apex.apache.org
>
> This is one of those best practices that we learn by experience during
> operator development. It will save a lot of time during operator
> development if we can catch and throw validation error when someone emits
> tuples in a non separate thread.
>
> Regards,
> Ashwin
>
> On Wed, Aug 10, 2016 at 1:57 PM, Munagala Ramanath 
> wrote:
>
>> For cases where use of a different thread is needed, it can write tuples
>> to a queue from where the operator thread pulls them --
>> JdbcPollInputOperator in Malhar has an example.
>>
>> Ram
>>
>> On Wed, Aug 10, 2016 at 1:50 PM, hsy...@gmail.com 
>> wrote:
>>
>>> Hey Vlad,
>>>
>>> Thanks for bringing this up. Is there an easy way to detect unexpected
>>> use of emit method without hurt the performance. Or at least if we can
>>> detect this in debug mode.
>>>
>>> Regards,
>>> Siyuan
>>>
>>> On Wed, Aug 10, 2016 at 11:27 AM, Vlad Rozov 
>>> wrote:
>>>
 The short answer is no, creating worker thread to emit tuples is not
 supported by Apex and will lead to an undefined behavior. Operators in 
Apex
 have strong thread affinity and all interaction with the platform must
 happen on the operator thread.

 Vlad

>>>
>>>
>>
>
>
> --
>
> Regards,
> Ashwin.
>



-- 

Regards,
Ashwin.





Re: empty operator/stream/module names

2016-08-04 Thread Sanjay Pujare
That’s a good point. System generated names can still be made to work for this 
use-case but I see the reason for having a name.

But then another set of questions come up: we need to validate the name for 
uniqueness within an app, valid syntax etc. May be it’s already being done.

On 8/4/16, 10:36 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:

It will not be possible to configure such operators from an XML file other
than through
wildcards -- but maybe that's OK.

Ram

On Thu, Aug 4, 2016 at 10:03 AM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> I differ. For the UI to render a DAG the names are useful, but if the name
> is not required by the engine i.e. the engine is able to execute your
> application fine with empty or null strings as names, is there any reason
> to make them mandatory?
>
> On the other hand, we can come up with a scheme for system generated names
> when the caller doesn’t provide a name. I have some ideas.
>
>
> On 8/4/16, 9:48 AM, "Munagala Ramanath" <r...@datatorrent.com> wrote:
>
> I don't see any reason to allow either.
>
> Ram
>
> On Thu, Aug 4, 2016 at 8:51 AM, Vlad Rozov <v.ro...@datatorrent.com>
> wrote:
>
> > Currently addOperator/addStream/addModule allows both null and empty
> > string in the operator/stream/module names. Is there any reason to
> allow
> > empty string? Should empty string and null be disallowed in those
> APIs?
> >
> > Vlad
> >
>
>
>
>





Re: empty operator/stream/module names

2016-08-04 Thread Sanjay Pujare
I differ. For the UI to render a DAG the names are useful, but if the name is 
not required by the engine i.e. the engine is able to execute your application 
fine with empty or null strings as names, is there any reason to make them 
mandatory? 

On the other hand, we can come up with a scheme for system generated names when 
the caller doesn’t provide a name. I have some ideas.


On 8/4/16, 9:48 AM, "Munagala Ramanath"  wrote:

I don't see any reason to allow either.

Ram

On Thu, Aug 4, 2016 at 8:51 AM, Vlad Rozov  wrote:

> Currently addOperator/addStream/addModule allows both null and empty
> string in the operator/stream/module names. Is there any reason to allow
> empty string? Should empty string and null be disallowed in those APIs?
>
> Vlad
>





Re: custom JAVA_HOME

2016-08-03 Thread Sanjay Pujare
+1 for Pramod’s idea of allowing all variables supported by YARN

On 8/3/16, 12:56 AM, "Chinmay Kolhatkar"  wrote:

+1 for the idea.
Are there any in the list that one can see as conflicting with our own
environment variables? For e.g. LOGNAME?


On Wed, Aug 3, 2016 at 4:49 AM, Pramod Immaneni 
wrote:

> How about allowing specification of all environment variables supported by
> YARN that are non-final described below
>
>
> 
http://atetric.com/atetric/javadoc/org.apache.hadoop/hadoop-yarn-api/0.23.3/org/apache/hadoop/yarn/api/ApplicationConstants.Environment.html
>
> Thanks
>
> On Tue, Aug 2, 2016 at 3:43 PM, Vlad Rozov 
> wrote:
>
> > Should Apex add JAVA_HOME to DAGContext and allow application to specify
> > which JDK to use if there are multiple JDK installations on Hadoop
> cluster?
> > Yarn already supports custom JAVA_HOME (please see
> > https://issues.apache.org/jira/browse/YARN-2481).
> >
> > Vlad
> >
>





Re: JMS Input Operator enhancements

2016-07-14 Thread Sanjay Pujare
Yes. ActiveMQ is already there and uses Apache so nothing to do there. The libs 
for AWS SQS also use the Apache license.

On 7/14/16, 2:42 PM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:

What is the license covering the libraries, Apache?

On Thu, Jul 14, 2016 at 2:37 PM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> Sounds good. Will go with test scope.
>
> On 7/14/16, 12:57 PM, "Ashwin Chandra Putta" <ashwinchand...@gmail.com>
> wrote:
>
> They can be made test scoped in malhar for testing. The user of this
> operator can add the implementation specific dependency in their
> application pom where they use this operator and supply the
> corresponding
> connection factory class name.
>
> Regards,
    > Ashwin.
>
> On Thu, Jul 14, 2016 at 11:55 AM, Sanjay Pujare <
> san...@datatorrent.com>
> wrote:
>
> > Pramod,
> >
> > The JMS “wrapper” library needs the actual client library as you can
> see
> > in com.datatorrent.lib.io.jms.JMSBase.getConnectionFactory()
> >
> > Class clazz =
> > (Class)Class.forName(connectionFactoryClass);
> >
> > Where it loads the client library’s connectionFactoryClass by name.
> > Instead of putting the onus on the caller, I thought we can include
> these
> > libraries, so the current class loader can load the required class.
> >
> > Sanjay
> >
> >
> > On 7/14/16, 11:36 AM, "Pramod Immaneni" <pra...@datatorrent.com>
> wrote:
> >
> > Hi Sanjay,
> >
> > If there are going to separate operators developed for SQS and
> ActiveMQ
> >     using the native api, why do we need to add the client libraries
> for
> > those
> > to JMS? Are these only needed for testing?
> >
> > Thanks
> >
> > On Thu, Jul 14, 2016 at 11:16 AM, Sanjay Pujare <
> > san...@datatorrent.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I am proposing the following enhancements to the Malhar JMS
> input
> > operator.
> > > Let me know if you have any comments.
> > >
> > > Enhancements are proposed to the malhar JMS Input Operator (in
> the
> > package
> > > com.datatorrent.lib.io.jms) to make it usable with any JMS
> compatible
> > > message broker for basic input functionality.
> > >
> > > For now we will make this input operator work with ActiveMQ 
and
> > Amazon SQS
> > > through the JMS interface. The current code contains
> implementation
> > and
> > > tests for ActiveMQ, and we will add support and tests for SQS
> without
> > > breaking the ActiveMQ support.
> > >
> > > The motivation for this enhancement is as follows. JMS is NOT
> a wire
> > > protocol but just an API specification for Java programs, so
> JMS is
> > just a
> > > “wrapper” around an actual implementation library that talks
> to the
> > > respective message broker. Moreover each message broker has
> its own
> > > semantics for more advanced features like partitioning or
> dynamic
> > scaling
> > > that are impossible or difficult to capture via the JMS
> abstraction.
> > With
> > > this enhancement, we expect the JMS input operator to be
> usable as a
> > > generic “JMS input operator” with no support for advanced
> features
> > like
> > > partionability or dynamic scalability.
> > >
> > > Full featured input operators for SQS and ActiveMQ will be
> developed
> > > separately (without necessarily using the JMS interface but
> the most
> > > appropriate interface) and the design and implementation of
> those
> > operators
> > > are out of scope here.
> > >
> > > The enh

Re: JMS Input Operator enhancements

2016-07-14 Thread Sanjay Pujare
Sounds good. Will go with test scope.

On 7/14/16, 12:57 PM, "Ashwin Chandra Putta" <ashwinchand...@gmail.com> wrote:

They can be made test scoped in malhar for testing. The user of this
operator can add the implementation specific dependency in their
application pom where they use this operator and supply the corresponding
connection factory class name.

Regards,
Ashwin.

On Thu, Jul 14, 2016 at 11:55 AM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> Pramod,
>
> The JMS “wrapper” library needs the actual client library as you can see
> in com.datatorrent.lib.io.jms.JMSBase.getConnectionFactory()
>
> Class clazz =
> (Class)Class.forName(connectionFactoryClass);
>
> Where it loads the client library’s connectionFactoryClass by name.
> Instead of putting the onus on the caller, I thought we can include these
> libraries, so the current class loader can load the required class.
>
> Sanjay
>
>
> On 7/14/16, 11:36 AM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:
>
> Hi Sanjay,
>
> If there are going to separate operators developed for SQS and 
ActiveMQ
> using the native api, why do we need to add the client libraries for
> those
> to JMS? Are these only needed for testing?
>
> Thanks
>
> On Thu, Jul 14, 2016 at 11:16 AM, Sanjay Pujare <
> san...@datatorrent.com>
> wrote:
>
> > Hi all,
> >
> > I am proposing the following enhancements to the Malhar JMS input
> operator.
> > Let me know if you have any comments.
> >
> > Enhancements are proposed to the malhar JMS Input Operator (in the
> package
> > com.datatorrent.lib.io.jms) to make it usable with any JMS 
compatible
> > message broker for basic input functionality.
> >
> > For now we will make this input operator work with ActiveMQ and
> Amazon SQS
> > through the JMS interface. The current code contains implementation
> and
> > tests for ActiveMQ, and we will add support and tests for SQS 
without
> > breaking the ActiveMQ support.
> >
> > The motivation for this enhancement is as follows. JMS is NOT a wire
> > protocol but just an API specification for Java programs, so JMS is
> just a
> > “wrapper” around an actual implementation library that talks to the
> > respective message broker. Moreover each message broker has its own
> > semantics for more advanced features like partitioning or dynamic
> scaling
> > that are impossible or difficult to capture via the JMS abstraction.
> With
> > this enhancement, we expect the JMS input operator to be usable as a
> > generic “JMS input operator” with no support for advanced features
> like
> > partionability or dynamic scalability.
> >
> > Full featured input operators for SQS and ActiveMQ will be developed
> > separately (without necessarily using the JMS interface but the most
> > appropriate interface) and the design and implementation of those
> operators
> > are out of scope here.
> >
> > The enhancement includes:
> >
> >-
> >
> >adding amazon-sqs-java-messaging-lib and aws-android-sdk-sqs
> libraries
> >to the pom
> >-
> >
> >adding test cases for SQS, matching those for ActiveMQ.
> >-
> >
> >making both sets of test cases work
> >-
> >
> >cleaning up the existing code (e.g. removing hard coded values
> like
> >“TEST.FOO” in JMSBase)
> >
> > SQS ConsiderationsNo Support for Topics
> >
> > SQS only supports queues, so topics will not be supported for the 
SQS
> > variant.
> > Idempotency
> >
> > We will implement a simple idempotent design that uses the
> > WindowDataManager to store the whole message and deletes the message
> from
> > SQS so we are not impacted by SQS’s visibility timeout feature.
> >
> >
> >
> > Sanjay
> >
>
>
>
>


-- 

Regards,
Ashwin.





Re: JMS Input Operator enhancements

2016-07-14 Thread Sanjay Pujare
Pramod,

The JMS “wrapper” library needs the actual client library as you can see in 
com.datatorrent.lib.io.jms.JMSBase.getConnectionFactory()

Class clazz = 
(Class)Class.forName(connectionFactoryClass);

Where it loads the client library’s connectionFactoryClass by name. Instead of 
putting the onus on the caller, I thought we can include these libraries, so 
the current class loader can load the required class.

Sanjay


On 7/14/16, 11:36 AM, "Pramod Immaneni" <pra...@datatorrent.com> wrote:

Hi Sanjay,

If there are going to separate operators developed for SQS and ActiveMQ
using the native api, why do we need to add the client libraries for those
to JMS? Are these only needed for testing?

Thanks

On Thu, Jul 14, 2016 at 11:16 AM, Sanjay Pujare <san...@datatorrent.com>
wrote:

> Hi all,
>
> I am proposing the following enhancements to the Malhar JMS input 
operator.
> Let me know if you have any comments.
>
> Enhancements are proposed to the malhar JMS Input Operator (in the package
> com.datatorrent.lib.io.jms) to make it usable with any JMS compatible
> message broker for basic input functionality.
>
> For now we will make this input operator work with ActiveMQ and Amazon SQS
> through the JMS interface. The current code contains implementation and
> tests for ActiveMQ, and we will add support and tests for SQS without
> breaking the ActiveMQ support.
>
> The motivation for this enhancement is as follows. JMS is NOT a wire
> protocol but just an API specification for Java programs, so JMS is just a
> “wrapper” around an actual implementation library that talks to the
> respective message broker. Moreover each message broker has its own
> semantics for more advanced features like partitioning or dynamic scaling
> that are impossible or difficult to capture via the JMS abstraction. With
> this enhancement, we expect the JMS input operator to be usable as a
> generic “JMS input operator” with no support for advanced features like
> partionability or dynamic scalability.
>
> Full featured input operators for SQS and ActiveMQ will be developed
> separately (without necessarily using the JMS interface but the most
> appropriate interface) and the design and implementation of those 
operators
> are out of scope here.
>
> The enhancement includes:
>
>-
>
>adding amazon-sqs-java-messaging-lib and aws-android-sdk-sqs libraries
>to the pom
>-
>
>adding test cases for SQS, matching those for ActiveMQ.
>-
>
>making both sets of test cases work
>-
>
>cleaning up the existing code (e.g. removing hard coded values like
>“TEST.FOO” in JMSBase)
>
> SQS ConsiderationsNo Support for Topics
>
> SQS only supports queues, so topics will not be supported for the SQS
> variant.
> Idempotency
>
> We will implement a simple idempotent design that uses the
> WindowDataManager to store the whole message and deletes the message from
> SQS so we are not impacted by SQS’s visibility timeout feature.
>
>
>
> Sanjay
>





JMS Input Operator enhancements

2016-07-14 Thread Sanjay Pujare
Hi all,

I am proposing the following enhancements to the Malhar JMS input operator.
Let me know if you have any comments.

Enhancements are proposed to the malhar JMS Input Operator (in the package
com.datatorrent.lib.io.jms) to make it usable with any JMS compatible
message broker for basic input functionality.

For now we will make this input operator work with ActiveMQ and Amazon SQS
through the JMS interface. The current code contains implementation and
tests for ActiveMQ, and we will add support and tests for SQS without
breaking the ActiveMQ support.

The motivation for this enhancement is as follows. JMS is NOT a wire
protocol but just an API specification for Java programs, so JMS is just a
“wrapper” around an actual implementation library that talks to the
respective message broker. Moreover each message broker has its own
semantics for more advanced features like partitioning or dynamic scaling
that are impossible or difficult to capture via the JMS abstraction. With
this enhancement, we expect the JMS input operator to be usable as a
generic “JMS input operator” with no support for advanced features like
partionability or dynamic scalability.

Full featured input operators for SQS and ActiveMQ will be developed
separately (without necessarily using the JMS interface but the most
appropriate interface) and the design and implementation of those operators
are out of scope here.

The enhancement includes:

   -

   adding amazon-sqs-java-messaging-lib and aws-android-sdk-sqs libraries
   to the pom
   -

   adding test cases for SQS, matching those for ActiveMQ.
   -

   making both sets of test cases work
   -

   cleaning up the existing code (e.g. removing hard coded values like
   “TEST.FOO” in JMSBase)

SQS ConsiderationsNo Support for Topics

SQS only supports queues, so topics will not be supported for the SQS
variant.
Idempotency

We will implement a simple idempotent design that uses the
WindowDataManager to store the whole message and deletes the message from
SQS so we are not impacted by SQS’s visibility timeout feature.



Sanjay


Re: Bleeding edge branch ?

2016-07-11 Thread Sanjay Pujare
As the name suggests the "bleeding-edge" branch ideally should use bleeding
edge versions so I would like to see Java 8 used there (and Hadoop 3 when
it does eventually come out) to make the maintenance effort worthwhile...

On Mon, Jul 11, 2016 at 12:05 PM, David Yan  wrote:

> I'm -0 on Java 8, but I'm +1 on the rest, and I'm especially strong +1 for
> upgrading the Hadoop dependency version.
>
> Here are my reasons:
>
> - Hadoop 3 will require Java 8, but Hadoop 2.7.2 still supports Java 7 and
> there will probably be some time (I'm guessing more than one year) for
> Hadoop 3 to become GA and for major distros to support Hadoop 3. The
> maintenance effort for having two branches, one for Java 7 and one for Java
> 8 is not worth it at this time.
>
> - Apex currently uses Hadoop 2.2 dependencies, marked "provided". And
> Hadoop 2.4 has been released more than two years ago, and it added a lot of
> features in the API that Apex can make use of. Most distros already bundle
> Hadoop 2.6 or later. Although some old versions of Cloudera that include
> hadoop version earlier than 2.4 still have not reached end-of-life yet, the
> number of users using those old versions is probably very small.
>
> David
>
>
> On Mon, Jul 11, 2016 at 8:59 AM, Munagala Ramanath 
> wrote:
>
> > We've had a number of issues recently related to dependencies on old
> > versions
> > of various packages/libraries such as Hadoop itself, Google guava,
> > HTTPClient,
> > mbassador, etc.
> >
> > How about we create a "bleeding-edge" branch in both Core and Malhar
> which
> > will use the latest versions of these various dependencies, upgrade to
> Java
> > 8 so
> > we can use the new Java features, etc. ?
> >
> > This will give us an opportunity to discover these sorts of problems
> early
> > and,
> > when we are ready to pull the trigger for a major version, we have a
> branch
> > ready
> > for merge with, hopefully, minimal additional effort.
> >
> > There will be no guarantees w.r.t. this branch so people using it use it
> at
> > their own
> > risk.
> >
> > Ram
> >
>


Re: Malhar contribution guidelines

2016-07-05 Thread Sanjay Pujare
Hi Pramod

Very useful document. Some questions/comments from me (sorry for any newbie
comments):

- for the "Update an Existing Operator" section, are there any backward
compatibility constraints one should be aware of?
- it will be very useful to have each guideline illustrated by an actual
example (e.g. an example of combining 2 or more operators in a module,
initialization and teardown cases in
constructor/setup/beginWindow/endWindow/deactivate/teardown etc)
- should the guidelines say something about unit tests and how unit tests
should typically be written for operators (also include other kinds of
automated tests in-case "unit" testing is difficult)
- didn't find any references to WindowedOperator anywhere (but did find
WindowedStream). Will be good to have hyperlinks for all references
- java 1.8/1.7 compatibility requirement?
- with respect to "...only the data for the first input port..." , how is
"first" input port determined?





On Tue, Jul 5, 2016 at 11:23 AM, Pramod Immaneni 
wrote:

> I received some feedback. Any other comments before adding these guidelines
> to the project.
>
> Thanks
>
> On Fri, Jun 17, 2016 at 3:12 PM, Pramod Immaneni 
> wrote:
>
> > Hi everyone,
> >
> > I wanted to create a set of guidelines that would help folks that want to
> > contribute to Malhar. The goal is that by following these guidelines the
> > contributions will be assured a certain level of quality as the different
> > aspects to consider, common missteps and mistakes will be taken care of
> > which in turn would also make the review process smoother by reducing the
> > number of review iterations before the contribution gets merged. I tried
> to
> > capture as much information as I thought would help developers towards
> this
> > goal based on past experience and exposure, in a document.
> >
> > Please go through it and provide your feedback, it will be greatly
> > appreciated. I will go through your comments and incorporate any
> necessary
> > changes. After that I hope this document will become a living document as
> > part of the contribution guidelines and evolve with the times.
> >
> >
> >
> https://drive.google.com/open?id=1WjbaIogVtMDQwbvxTlQxrFqUbK9D75bk98asW9e5OxA
> >
> > Thanks
> >
>