[jira] [Commented] (APEXCORE-596) Committed method on operators not called when stream locality is THREAD_LOCAL

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825574#comment-15825574
 ] 

ASF GitHub Bot commented on APEXCORE-596:
-

Github user francisf closed the pull request at:

https://github.com/apache/apex-core/pull/442


> Committed method on operators not called when stream locality is THREAD_LOCAL
> -
>
> Key: APEXCORE-596
> URL: https://issues.apache.org/jira/browse/APEXCORE-596
> Project: Apache Apex Core
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Francis Fernandes
>Assignee: Francis Fernandes
>
> When the locality of the stream connecting the two operators is 
> Locality.THREAD_LOCAL, the committed method is not called for some operators. 
> These operators implement the Operator.CheckpointListener. e.g. 
> AbstractFileOutputOperator
> For thread local during activate  we do not set the thread in the node's 
> context
> Because the thread is not set, we skip this operator in the 
> processHeartBeatResponse and the committed is not called
> {code}
> if (thread == null || !thread.isAlive()) {
>   continue;
> }
> {code}
> We need this condition for invalid operators (operator failures) in case of 
> other localities. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-596) Committed method on operators not called when stream locality is THREAD_LOCAL

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825575#comment-15825575
 ] 

ASF GitHub Bot commented on APEXCORE-596:
-

GitHub user francisf reopened a pull request:

https://github.com/apache/apex-core/pull/442

APEXCORE-596 Setting the thread for all oio nodes in the oio group

@tushargosavi please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/francisf/apex-core APEXCORE-596_OioThreadSet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #442


commit 3debbb998cfbb1866df32fe77e959e28412916a0
Author: francisf 
Date:   2017-01-05T07:32:05Z

APEXCORE-596 Setting the thread for all oio nodes in the oio group, 
refactoring tests




> Committed method on operators not called when stream locality is THREAD_LOCAL
> -
>
> Key: APEXCORE-596
> URL: https://issues.apache.org/jira/browse/APEXCORE-596
> Project: Apache Apex Core
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Francis Fernandes
>Assignee: Francis Fernandes
>
> When the locality of the stream connecting the two operators is 
> Locality.THREAD_LOCAL, the committed method is not called for some operators. 
> These operators implement the Operator.CheckpointListener. e.g. 
> AbstractFileOutputOperator
> For thread local during activate  we do not set the thread in the node's 
> context
> Because the thread is not set, we skip this operator in the 
> processHeartBeatResponse and the committed is not called
> {code}
> if (thread == null || !thread.isAlive()) {
>   continue;
> }
> {code}
> We need this condition for invalid operators (operator failures) in case of 
> other localities. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-core pull request #442: APEXCORE-596 Setting the thread for all oio nod...

2017-01-16 Thread francisf
GitHub user francisf reopened a pull request:

https://github.com/apache/apex-core/pull/442

APEXCORE-596 Setting the thread for all oio nodes in the oio group

@tushargosavi please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/francisf/apex-core APEXCORE-596_OioThreadSet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #442


commit 3debbb998cfbb1866df32fe77e959e28412916a0
Author: francisf 
Date:   2017-01-05T07:32:05Z

APEXCORE-596 Setting the thread for all oio nodes in the oio group, 
refactoring tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-core pull request #442: APEXCORE-596 Setting the thread for all oio nod...

2017-01-16 Thread francisf
Github user francisf closed the pull request at:

https://github.com/apache/apex-core/pull/442


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Contribution Process before PR

2017-01-16 Thread Chinmay Kolhatkar
I agree that the discussions are important. There are currently discussions
being happening over mailing list but some topics might be get missed and
we should improve there. The tasks should be brought to community before
the design starts and then mature the design through community discussion.
Though there are few of my observations that I would like to point out:

1. If I understand correctly, any development happening should still be
bound by a time. This includes the discussions over mailing list OR on the
PR as well. If the design discussion are happening over PR, then it means
the mailing list communication is failing somewhere.

2. Moreover, the nature of discussion should be conclusive. There are some
discussions which keeps happening over and over and takes time to reach a
conclusion. IMO this is blocking the development and does not benefit
anyone.

3. There are mailing threads (few but important ones) which have as much as
30+ email exchanges. After a point of time, I believe the crux of the topic
is somewhere missed. If an email thread is going beyond 10 mails to reach a
conclusion, then there is something wrong in way the discussion is heading
and its is the responsibility of everyone and not just original author to
"hold the thought" and re-align the discussion to reach a conclusion and
consensus.

4. At the same time, there are possibly cases where PR comments gets a
shape of design discussion because the original thread is not followed up
completely. Everyone has enough to catch up hence one might miss following
a certain mail thread but might point of some crucial points in PR comment.
In such case, I think there should be an alternate way (maybe an offline
call for author to clear the idea) might help. This is just to make sure
that the progress is made in current task and should not be followed as an
standard practice.

5. We as community should emphasis on the fact that the quality code is
generated but also should be mindful if time it is taking. Its hard to get
the right balance, hence it might be a good practice to break things down
to phases instead of trying to do in a single go. In such case, the
community should be highlighted that what is the first phase, second phase
and so on


On Tue, Jan 17, 2017 at 12:04 PM, Bhupesh Chawda 
wrote:

> I agree.
> The discussion on requirements very much helps folks to understand the need
> for a feature and sometimes much more about related topics. From my own
> experience, the discussion thread on requirements before the implementation
> is what helps the design discussions and allows even people not aware of
> the internals, to quickly pickup things with little effort.
>
> At the same time, I do feel the need for a prototype implementation to
> clarify things, most of the times. But, I think this should be
> supplementary to the discussions and not a replacement.
>
> ~ Bhupesh
>
> On Tue, Jan 17, 2017 at 7:50 AM, Sandesh Hegde 
> wrote:
>
> > I do see value in having the review PR along with the discussion in
> > the mailing list/jira.
> >
> > It shows the commitment of the Author to the idea and to the project.
> > There is some truth in what Linus Torvalds says
> > https://lkml.org/lkml/2000/8/25/132
> >
> > Thanks
> >
> >
> > On Mon, Jan 16, 2017 at 8:11 AM Amol Kekre  wrote:
> >
> > > I do see folks discussing issues for most part now. For me this does
> not
> > > look like an issue that is hurting Apex community. I do however want to
> > > discuss cases where code gets blocked and get spilled into larger
> > context.
> > > As a culture and a process, we have a very high overhead that deters
> > > contributions.
> > >
> > > Thks
> > > Amol
> > >
> > >
> > > On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneni <
> pra...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Yes, it will be good to have these points added to the contributor
> > > > guidelines but I also see for the most part folks do bring up issues
> > for
> > > > discussion, try to address concerns and come to a consensus and in
> > > > generally participate in the community. I also think we should have
> > some
> > > > latitude in the process when it comes to bug fixes that are contained
> > and
> > > > don't spill into re-design of components otherwise, the overhead will
> > > deter
> > > > contributions especially from folks who are new to the project and
> want
> > > to
> > > > start contributing by fixing low hanging bugs.
> > > >
> > > > Thanks
> > > >
> > > > On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise 
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I want to propose additions to the contributor guidelines that
> place
> > > > > stronger emphasis on open collaboration and the early part of the
> > > > > contribution process.
> > > > >
> > > > > Specifically, I would like to suggest that *thought process* and
> > > *design
> > > > > discussion* are more important than the 

Re: Schema Discovery Support in Apex Applications

2017-01-16 Thread Bhupesh Chawda
+1 for the feature.

~ Bhupesh

On Mon, Jan 16, 2017 at 5:09 PM, Chinmay Kolhatkar 
wrote:

> Those are not really anonymous POJOs... The definition of POJO will be
> known to user as based on that only upstream operator will convey the tuple
> type the operator will be emitting.
> Using that information user can configure the operators. Those properties
> will be a bit different though.
>
> On Mon, Jan 16, 2017 at 4:20 PM, AJAY GUPTA  wrote:
>
> > +1 for the idea.
> >
> > I just had one question.
> >
> > As I understand, there will be some form of Anonymous POJO used as
> objects
> > to pass information from one operator to another. Can you share how the
> > user/operator developer would access the tuple object in case he wishes
> to
> > do something with it?
> >
> >
> > Ajay
> >
> > On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkar 
> > wrote:
> >
> > > Hi All,
> > >
> > > Currently a DAG that is generated by user, if contains any POJOfied
> > > operators, TUPLE_CLASS attribute needs to be set on each and every port
> > > which receives or sends a POJO.
> > >
> > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
> > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user
> on
> > > both input and output ports of transform, dedup operators and also on
> > > parser output and formatter input.
> > >
> > > The proposal here is to reduce work that is required by user to
> configure
> > > the DAG. Technically speaking if an operators knows input schema and
> > > processing properties, it can determine output schema and convey it to
> > > downstream operators. This way the complete pipeline can be configured
> > > without user setting TUPLE_CLASS or even creating POJOs and adding them
> > to
> > > classpath.
> > >
> > > On the same idea, I want to propose an approach where the pipeline can
> be
> > > configured without user setting TUPLE_CLASS or even creating POJOs and
> > > adding them to classpath.
> > > Here is the document which at a high level explains the idea and a high
> > > level design:
> > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
> > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
> > >
> > > I would like to get opinion from community about feasibility and
> > > applications of this proposal.
> > > Once we get some consensus we can discuss the design in details.
> > >
> > > Thanks,
> > > Chinmay.
> > >
> >
>


Re: Contribution Process before PR

2017-01-16 Thread Bhupesh Chawda
I agree.
The discussion on requirements very much helps folks to understand the need
for a feature and sometimes much more about related topics. From my own
experience, the discussion thread on requirements before the implementation
is what helps the design discussions and allows even people not aware of
the internals, to quickly pickup things with little effort.

At the same time, I do feel the need for a prototype implementation to
clarify things, most of the times. But, I think this should be
supplementary to the discussions and not a replacement.

~ Bhupesh

On Tue, Jan 17, 2017 at 7:50 AM, Sandesh Hegde 
wrote:

> I do see value in having the review PR along with the discussion in
> the mailing list/jira.
>
> It shows the commitment of the Author to the idea and to the project.
> There is some truth in what Linus Torvalds says
> https://lkml.org/lkml/2000/8/25/132
>
> Thanks
>
>
> On Mon, Jan 16, 2017 at 8:11 AM Amol Kekre  wrote:
>
> > I do see folks discussing issues for most part now. For me this does not
> > look like an issue that is hurting Apex community. I do however want to
> > discuss cases where code gets blocked and get spilled into larger
> context.
> > As a culture and a process, we have a very high overhead that deters
> > contributions.
> >
> > Thks
> > Amol
> >
> >
> > On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneni  >
> > wrote:
> >
> > > Yes, it will be good to have these points added to the contributor
> > > guidelines but I also see for the most part folks do bring up issues
> for
> > > discussion, try to address concerns and come to a consensus and in
> > > generally participate in the community. I also think we should have
> some
> > > latitude in the process when it comes to bug fixes that are contained
> and
> > > don't spill into re-design of components otherwise, the overhead will
> > deter
> > > contributions especially from folks who are new to the project and want
> > to
> > > start contributing by fixing low hanging bugs.
> > >
> > > Thanks
> > >
> > > On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise  wrote:
> > >
> > > > Hi,
> > > >
> > > > I want to propose additions to the contributor guidelines that place
> > > > stronger emphasis on open collaboration and the early part of the
> > > > contribution process.
> > > >
> > > > Specifically, I would like to suggest that *thought process* and
> > *design
> > > > discussion* are more important than the final code produced. It is
> > > > necessary to develop the community and invest in the future of the
> > > project.
> > > >
> > > > I start this discussion based on observation over time. I have seen
> > cases
> > > > (non trivial changes) where code and JIRAs appear at the same time,
> > where
> > > > the big picture is discussed after the PR is already open, or where
> > > > information that would be valuable to other contributors or users
> isn't
> > > on
> > > > record.
> > > >
> > > > Let's consider a non-trivial change or a feature. It would normally
> > start
> > > > with engagement on the mailing list to ensure time is well spent and
> > the
> > > > proposal is welcomed by the community, does not conflict with other
> > > > initiatives etc.
> > > >
> > > > Once that is cleared, we would want to think about design, the how in
> > the
> > > > larger picture. In many cases that would involve discussion,
> questions,
> > > > suggestions, consensus building towards agreed approach. Or maybe it
> is
> > > > done through prototyping. In any case, before a PR is raised, it will
> > be
> > > > good to have as prerequisite that *thought process and approach have
> > been
> > > > documented*. I would prefer to see that on the JIRA, what do others
> > > think?
> > > >
> > > > Benefits:
> > > >
> > > > * Contributor does not waste time and there is no frustration due to
> a
> > PR
> > > > being turned down for reasons that could be avoided with upfront
> > > > communication.
> > > >
> > > > * Contributor benefits from suggestions, questions, guidance of those
> > > with
> > > > in depth knowledge of particular areas.
> > > >
> > > > * Other community members have an opportunity to learn from
> discussion,
> > > the
> > > > knowledge base broadens.
> > > >
> > > > * Information gets indexed, user later looking at JIRAs will find
> > > valuable
> > > > information on how certain problems were solved that they would never
> > > > obtain from a PR.
> > > >
> > > > The ASF and "Apache Way", a read for the bigger picture with more
> links
> > > in
> > > > it:
> > > > http://krzysztof-sobkowiak.net/blog/celebrating-17-years-
> > > > of-the-apache-software-foundation/
> > > >
> > > > Looking forward to feedback and discussion,
> > > > Thomas
> > > >
> > >
> >
>


Re: Extending DAG API for accessing objects in DAG.

2017-01-16 Thread Tushar Gosavi
I have opened https://issues.apache.org/jira/browse/APEXCORE-604 for this.
If there is no objection I will open an pull request for it.

- Tushar.


On Mon, Jan 9, 2017 at 4:44 PM, Tushar Gosavi  wrote:
> Hi All,
>
> I am planing extend the DAG api to export internals of the DAG.
> Current DAG does not provide a way to get the list of operators and
> streams with their attributes. Also streamMeta does not provide
> API to access end ports.
>
> This type of information is needed when external translator (like
> Samoa) are constructing the DAG,
> and we need to apply some transformation or configure the DAG before
> it is run. Planing to extend
> DAG API with following.
>
> InputPortMeta
> public Operator.InputPort getPortObject();
> OutputPortMeta
> public Operator.OutputPort getPortObject();
> streamMeta
> public  T getSource();
> public  Collection getSinks();
> DAG
> public abstract  Collection getOperators();
> public abstract  Collection getStreams();
>
> Regards,
> - Tushar.


Re: Sharing jars among different Apex apps on cluster

2017-01-16 Thread Bhupesh Chawda
Yes Amol, the PR that Thomas pointed out is for that purpose.
It allows to change the plan at runtime. We can remove all opertors
entirely and add new operators to the existing dag.

~ Bhupesh

On Jan 16, 2017 23:38, "Amol Kekre"  wrote:

> Bhupesh,
> If stram changes the logical dag (runs only the sub-dag) on each runs, it
> should be ok.
>
> Thks
> Amol
>
>
> On Mon, Jan 16, 2017 at 9:18 AM, Bhupesh Chawda 
> wrote:
>
> > Yes, I thought of that. That way it will be a single application
> > throughout.
> > Just wanted to see if there can be other options.
> >
> > ~ Bhupesh
> >
> > On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weise  wrote:
> >
> > > Sounds like a fit for https://github.com/apache/apex-core/pull/410 ?
> > >
> > > On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda <
> bhup...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We have a use case where I need to launch a number of DAGs on the
> > cluster
> > > > one after the other in sequence programatically.
> > > >
> > > > We are using the StramAppLauncher and StramAppFactory classes to
> > launch a
> > > > DAG programatically on the cluster and adding any third party
> > > dependencies
> > > > as part of the configuration.
> > > >
> > > > It is working fine except for the following issue:
> > > > Every time a DAG is launched, it copies the dependencies to the
> > > application
> > > > folder and hence spends a good amount of time before the app actually
> > > > starts running. All of the apps I run belong to the same project and
> > > hence
> > > > don't actually need separate set of jars.
> > > >
> > > > Is there any way I can make all the applications "share" the jars
> which
> > > are
> > > > uploaded when the first application is run?
> > > >
> > > > Thanks.
> > > >
> > > > ~ Bhupesh
> > > >
> > >
> >
>


Re: Suggestion on optimise kryo Output

2017-01-16 Thread Sandesh Hegde
Kryo is used in a default implementation of the StreamCodec interface.
Ideally, if the StreamCodec interface itself allows the buffer to be passed
then we can also send the buffer from the BufferServer in future.

On Mon, Jan 9, 2017 at 4:10 PM Bright Chen  wrote:

> Hi,
>
> The kryo Output has some limitation
>
>- The size of the data is limited. kryo write data to a fixed buffer, it
>will throw the overflow exception if the data exceed the size
>- The Output.toBytes() will copy the data to temporary buffer and
>output, it will decrease the performance and introduce garbage
> collection.
>
> When I was tuning Spillable Data structure and Manage State. I create a
> mechanism to share and reuse the memory to avoid above problem. And it can
> be reused in core serialization with small change. Please see jira:
> https://issues.apache.org/jira/browse/APEXMALHAR-2190
>
>
> Any suggestion or comments, please put in jira:
> https://issues.apache.org/jira/browse/APEXCORE-606
>
>
> thanks
>
> Bright
>


[jira] [Commented] (APEXCORE-570) Prevent upstream operators from getting too far ahead when downstream operators are slow

2017-01-16 Thread Pramod Immaneni (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824628#comment-15824628
 ] 

Pramod Immaneni commented on APEXCORE-570:
--

I see your point that you could disable back pressure on the bursty one and get 
a better throughput.

> Prevent upstream operators from getting too far ahead when downstream 
> operators are slow
> 
>
> Key: APEXCORE-570
> URL: https://issues.apache.org/jira/browse/APEXCORE-570
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> If the downstream operators are slower than upstream operators then the 
> upstream operators will get ahead and the gap can continue to increase. 
> Provide an option to slow down or temporarily pause the upstream operators 
> when they get too far ahead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-570) Prevent upstream operators from getting too far ahead when downstream operators are slow

2017-01-16 Thread Pramod Immaneni (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824455#comment-15824455
 ] 

Pramod Immaneni commented on APEXCORE-570:
--

"Consider the case of two subscribers with same throughput but different 
pattern. Once with constant and capped rate, other with bursts of activity. If 
they are forced to read at the same rate, then both will be slower and overall 
throughput reduce? Granted this is not the common scenario, but something to 
consider."

Yes it would result in a slower throughput but since the operator is controlled 
by a single thread I think back pressure over one stream will block the other.

> Prevent upstream operators from getting too far ahead when downstream 
> operators are slow
> 
>
> Key: APEXCORE-570
> URL: https://issues.apache.org/jira/browse/APEXCORE-570
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> If the downstream operators are slower than upstream operators then the 
> upstream operators will get ahead and the gap can continue to increase. 
> Provide an option to slow down or temporarily pause the upstream operators 
> when they get too far ahead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Sharing jars among different Apex apps on cluster

2017-01-16 Thread Amol Kekre
Bhupesh,
If stram changes the logical dag (runs only the sub-dag) on each runs, it
should be ok.

Thks
Amol


On Mon, Jan 16, 2017 at 9:18 AM, Bhupesh Chawda 
wrote:

> Yes, I thought of that. That way it will be a single application
> throughout.
> Just wanted to see if there can be other options.
>
> ~ Bhupesh
>
> On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weise  wrote:
>
> > Sounds like a fit for https://github.com/apache/apex-core/pull/410 ?
> >
> > On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda  >
> > wrote:
> >
> > > Hi All,
> > >
> > > We have a use case where I need to launch a number of DAGs on the
> cluster
> > > one after the other in sequence programatically.
> > >
> > > We are using the StramAppLauncher and StramAppFactory classes to
> launch a
> > > DAG programatically on the cluster and adding any third party
> > dependencies
> > > as part of the configuration.
> > >
> > > It is working fine except for the following issue:
> > > Every time a DAG is launched, it copies the dependencies to the
> > application
> > > folder and hence spends a good amount of time before the app actually
> > > starts running. All of the apps I run belong to the same project and
> > hence
> > > don't actually need separate set of jars.
> > >
> > > Is there any way I can make all the applications "share" the jars which
> > are
> > > uploaded when the first application is run?
> > >
> > > Thanks.
> > >
> > > ~ Bhupesh
> > >
> >
>


Re: Sharing jars among different Apex apps on cluster

2017-01-16 Thread Bhupesh Chawda
Yes, I thought of that. That way it will be a single application throughout.
Just wanted to see if there can be other options.

~ Bhupesh

On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weise  wrote:

> Sounds like a fit for https://github.com/apache/apex-core/pull/410 ?
>
> On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda 
> wrote:
>
> > Hi All,
> >
> > We have a use case where I need to launch a number of DAGs on the cluster
> > one after the other in sequence programatically.
> >
> > We are using the StramAppLauncher and StramAppFactory classes to launch a
> > DAG programatically on the cluster and adding any third party
> dependencies
> > as part of the configuration.
> >
> > It is working fine except for the following issue:
> > Every time a DAG is launched, it copies the dependencies to the
> application
> > folder and hence spends a good amount of time before the app actually
> > starts running. All of the apps I run belong to the same project and
> hence
> > don't actually need separate set of jars.
> >
> > Is there any way I can make all the applications "share" the jars which
> are
> > uploaded when the first application is run?
> >
> > Thanks.
> >
> > ~ Bhupesh
> >
>


Re: Contribution Process before PR

2017-01-16 Thread Amol Kekre
I do see folks discussing issues for most part now. For me this does not
look like an issue that is hurting Apex community. I do however want to
discuss cases where code gets blocked and get spilled into larger context.
As a culture and a process, we have a very high overhead that deters
contributions.

Thks
Amol


On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneni 
wrote:

> Yes, it will be good to have these points added to the contributor
> guidelines but I also see for the most part folks do bring up issues for
> discussion, try to address concerns and come to a consensus and in
> generally participate in the community. I also think we should have some
> latitude in the process when it comes to bug fixes that are contained and
> don't spill into re-design of components otherwise, the overhead will deter
> contributions especially from folks who are new to the project and want to
> start contributing by fixing low hanging bugs.
>
> Thanks
>
> On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise  wrote:
>
> > Hi,
> >
> > I want to propose additions to the contributor guidelines that place
> > stronger emphasis on open collaboration and the early part of the
> > contribution process.
> >
> > Specifically, I would like to suggest that *thought process* and *design
> > discussion* are more important than the final code produced. It is
> > necessary to develop the community and invest in the future of the
> project.
> >
> > I start this discussion based on observation over time. I have seen cases
> > (non trivial changes) where code and JIRAs appear at the same time, where
> > the big picture is discussed after the PR is already open, or where
> > information that would be valuable to other contributors or users isn't
> on
> > record.
> >
> > Let's consider a non-trivial change or a feature. It would normally start
> > with engagement on the mailing list to ensure time is well spent and the
> > proposal is welcomed by the community, does not conflict with other
> > initiatives etc.
> >
> > Once that is cleared, we would want to think about design, the how in the
> > larger picture. In many cases that would involve discussion, questions,
> > suggestions, consensus building towards agreed approach. Or maybe it is
> > done through prototyping. In any case, before a PR is raised, it will be
> > good to have as prerequisite that *thought process and approach have been
> > documented*. I would prefer to see that on the JIRA, what do others
> think?
> >
> > Benefits:
> >
> > * Contributor does not waste time and there is no frustration due to a PR
> > being turned down for reasons that could be avoided with upfront
> > communication.
> >
> > * Contributor benefits from suggestions, questions, guidance of those
> with
> > in depth knowledge of particular areas.
> >
> > * Other community members have an opportunity to learn from discussion,
> the
> > knowledge base broadens.
> >
> > * Information gets indexed, user later looking at JIRAs will find
> valuable
> > information on how certain problems were solved that they would never
> > obtain from a PR.
> >
> > The ASF and "Apache Way", a read for the bigger picture with more links
> in
> > it:
> > http://krzysztof-sobkowiak.net/blog/celebrating-17-years-
> > of-the-apache-software-foundation/
> >
> > Looking forward to feedback and discussion,
> > Thomas
> >
>


Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

2017-01-16 Thread Thomas Weise
Bhupesh,

Please see how that can be solved in a unified way using windows and
watermarks. It is bounded data vs. unbounded data. In Beam for example, you
can use the "global window" and the final watermark to accomplish what you
are looking for. Batch is just a special case of streaming where the source
emits the final watermark.

Thanks,
Thomas


On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda 
wrote:

> Yes, if the user needs to develop a batch application, then batch aware
> operators need to be used in the application.
> The nature of the application is mostly controlled by the input and the
> output operators used in the application.
>
> For example, consider an application which needs to filter records in a
> input file and store the filtered records in another file. The nature of
> this app is to end once the entire file is processed. Following things are
> expected of the application:
>
>1. Once the input data is over, finalize the output file from .tmp
>files. - Responsibility of output operator
>2. End the application, once the data is read and processed -
>Responsibility of input operator
>
> These functions are essential to allow the user to do higher level
> operations like scheduling or running a workflow of batch applications.
>
> I am not sure about intermediate (processing) operators, as there is no
> change in their functionality for batch use cases. Perhaps, allowing
> running multiple batches in a single application may require similar
> changes in processing operators as well.
>
> ~ Bhupesh
>
> On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale 
> wrote:
>
> > Will it make an impression on user that, if he has a batch usecase he has
> > to use batch aware operators only? If so, is that what we expect? I am
> not
> > aware of how do we implement batch scenario so this might be a basic
> > question.
> >
> > -Priyanka
> >
> > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> bhup...@datatorrent.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > While design / implementation for custom control tuples is ongoing, I
> > > thought it would be a good idea to consider its usefulness in one of
> the
> > > use cases -  batch applications.
> > >
> > > This is a proposal to adapt / extend existing operators in the Apache
> > Apex
> > > Malhar library so that it is easy to use them in batch use cases.
> > > Naturally, this would be applicable for only a subset of operators like
> > > File, JDBC and NoSQL databases.
> > > For example, for a file based store, (say HDFS store), we could have
> > > FileBatchInput and FileBatchOutput operators which allow easy
> integration
> > > into a batch application. These operators would be extended from their
> > > existing implementations and would be "Batch Aware", in that they may
> > > understand the meaning of some specific control tuples that flow
> through
> > > the DAG. Start batch and end batch seem to be the obvious candidates
> that
> > > come to mind. On receipt of such control tuples, they may try to modify
> > the
> > > behavior of the operator - to reinitialize some metrics or finalize an
> > > output file for example.
> > >
> > > We can discuss the potential control tuples and actions in detail, but
> > > first I would like to understand the views of the community for this
> > > proposal.
> > >
> > > ~ Bhupesh
> > >
> >
>


Re: Sharing jars among different Apex apps on cluster

2017-01-16 Thread Thomas Weise
Sounds like a fit for https://github.com/apache/apex-core/pull/410 ?

On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda 
wrote:

> Hi All,
>
> We have a use case where I need to launch a number of DAGs on the cluster
> one after the other in sequence programatically.
>
> We are using the StramAppLauncher and StramAppFactory classes to launch a
> DAG programatically on the cluster and adding any third party dependencies
> as part of the configuration.
>
> It is working fine except for the following issue:
> Every time a DAG is launched, it copies the dependencies to the application
> folder and hence spends a good amount of time before the app actually
> starts running. All of the apps I run belong to the same project and hence
> don't actually need separate set of jars.
>
> Is there any way I can make all the applications "share" the jars which are
> uploaded when the first application is run?
>
> Thanks.
>
> ~ Bhupesh
>


Re: APEXMALHAR-2382 User needs to create dt_meta table while using JdbcPOJOInsertOutputOperator

2017-01-16 Thread Thomas Weise
For JDBC exactly-once results the window meta data need to be committed
along with the user data, hence the need for additional schema in the
target database. There have been webinar and blogs about processing
guarantees, please have a look at those.

Table creation cannot be automatic by default. As I already said on the
JIRA it could be an option for testing or for development, it just needs to
be off by default. I would like to see the table name changed.

If the library can generate the DDL for the table for the target database,
then possibly when the table is missing it can output that along with
instructions into the log to make things easier for users.

It might also be a good idea to document the setup steps for multiple
operators to write to the same schema.

Thanks

On Sun, Jan 15, 2017 at 11:55 PM, Devendra Tagare  wrote:

> Hi,
>
> -1 on auto table creation.Reasons have been eloquently elaborated in the
> earlier posts.
>
> +1 on revisiting the approach taken for exactly once in the
> JDBCPollInputOperator.
>
> One way could be to move the JDBC read and write operators into a separate
> module (like apex-malhar/kafka) and add statistics, metrics,meta-data
> features along similar lines.This module can have required concrete
> implementations for mysql, psql etc..
>
> Thanks,
> Dev
>
> On Sun, Jan 15, 2017 at 10:34 PM, Chinmay Kolhatkar 
> wrote:
>
> > -1 for automatic schema creation...
> >
> > Moreover, I am wondering whether asking user to create a dt_meta table is
> > right way. From an admins perspective, an ask for creation of meta table
> > looks wrong to me. dt_meta table is created for the purpose of exactly
> once
> > but it does not hold any user data.. On this logic admin might deny
> > developers for creation of table.
> >
> > I suggest to start a separate thread to do exactly once for JDBC insert
> in
> > a cleaner way. We take take a look at Kafka or File outputs to see how
> > they've done to achieve exactly once without creating a meta location at
> > destination.
> >
> > -Chinmay.
> >
> >
> > On Mon, Jan 16, 2017 at 11:16 AM, Pradeep Kumbhar <
> prad...@datatorrent.com
> > >
> > wrote:
> >
> > > +1 on having operator documentation explicitly mentioning that,
> "dt_meta"
> > > table is mandatory
> > > for the operator to work correctly. Also provide a sample table
> creation
> > > query for reference.
> > >
> > > On Sat, Jan 14, 2017 at 1:05 PM, AJAY GUPTA 
> > wrote:
> > >
> > > > Since the query can be different for different databases, the user
> will
> > > > have to provide query to the operator. Rather than this, I believe
> it's
> > > > easier for user to directly execute create table query on DB.
> > > >
> > > > Also, the create table script won't be that heavy that we create
> script
> > > for
> > > > it. Probably adding a generic type of query in the docs itself should
> > > > suffice.
> > > >
> > > >
> > > > Ajay
> > > >
> > > > On Sat, 14 Jan 2017 at 10:27 AM, Yogi Devendra <
> > yogideven...@apache.org>
> > > > wrote:
> > > >
> > > > > As Aniruddha pointed out, table creation should be done by dbadmin.
> > > > >
> > > > > In that case, utility script will be helpful.
> > > > >
> > > > >
> > > > >
> > > > > If we embed this code inside operator or application; then it will
> be
> > > > >
> > > > > difficult for dbadmin to use it.
> > > > >
> > > > >
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > >
> > > > >
> > > > > On 14 January 2017 at 03:43, Thomas Weise  wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -1 for automatic schema modification, unless the user asked for
> it.
> > > See
> > > > >
> > > > > > comment on JIRA.
> > > > >
> > > > > >
> > > > >
> > > > > >
> > > > >
> > > > > > On Fri, Jan 13, 2017 at 5:11 AM, Aniruddha Thombare <
> > > > >
> > > > > > anirud...@datatorrent.com> wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > > The tables should be created / altered by dbadmin.
> > > > >
> > > > > > > We shouldn't worry about table creations as its one-time
> > activity.
> > > > >
> > > > > > >
> > > > >
> > > > > > >
> > > > >
> > > > > > >
> > > > >
> > > > > > > Thanks,
> > > > >
> > > > > > >
> > > > >
> > > > > > > A
> > > > >
> > > > > > >
> > > > >
> > > > > > >
> > > > >
> > > > > > > _
> > > > >
> > > > > > > Sent with difficulty, I mean handheld ;)
> > > > >
> > > > > > >
> > > > >
> > > > > > > On 13 Jan 2017 6:37 pm, "Yogi Devendra" <
> yogideven...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > >
> > > > >
> > > > > > > I am not very keen on having utility script.
> > > > >
> > > > > > > But, "no side-effects without explicit ask by the end-user" is
> > > > > important.
> > > > >
> > > > > > >
> > > > >
> > > > > > > ~ Yogi
> > > > >
> > > > > > >
> > > > >
> > > > > > > On 13 January 2017 at 16:44, Priyanka Gugale <
> pri...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > >
> > > > >
> > > > > > > > IMO it's 

Re: Schema Discovery Support in Apex Applications

2017-01-16 Thread Chinmay Kolhatkar
Those are not really anonymous POJOs... The definition of POJO will be
known to user as based on that only upstream operator will convey the tuple
type the operator will be emitting.
Using that information user can configure the operators. Those properties
will be a bit different though.

On Mon, Jan 16, 2017 at 4:20 PM, AJAY GUPTA  wrote:

> +1 for the idea.
>
> I just had one question.
>
> As I understand, there will be some form of Anonymous POJO used as objects
> to pass information from one operator to another. Can you share how the
> user/operator developer would access the tuple object in case he wishes to
> do something with it?
>
>
> Ajay
>
> On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkar 
> wrote:
>
> > Hi All,
> >
> > Currently a DAG that is generated by user, if contains any POJOfied
> > operators, TUPLE_CLASS attribute needs to be set on each and every port
> > which receives or sends a POJO.
> >
> > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
> > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on
> > both input and output ports of transform, dedup operators and also on
> > parser output and formatter input.
> >
> > The proposal here is to reduce work that is required by user to configure
> > the DAG. Technically speaking if an operators knows input schema and
> > processing properties, it can determine output schema and convey it to
> > downstream operators. This way the complete pipeline can be configured
> > without user setting TUPLE_CLASS or even creating POJOs and adding them
> to
> > classpath.
> >
> > On the same idea, I want to propose an approach where the pipeline can be
> > configured without user setting TUPLE_CLASS or even creating POJOs and
> > adding them to classpath.
> > Here is the document which at a high level explains the idea and a high
> > level design:
> > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
> > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
> >
> > I would like to get opinion from community about feasibility and
> > applications of this proposal.
> > Once we get some consensus we can discuss the design in details.
> >
> > Thanks,
> > Chinmay.
> >
>


Sharing jars among different Apex apps on cluster

2017-01-16 Thread Bhupesh Chawda
Hi All,

We have a use case where I need to launch a number of DAGs on the cluster
one after the other in sequence programatically.

We are using the StramAppLauncher and StramAppFactory classes to launch a
DAG programatically on the cluster and adding any third party dependencies
as part of the configuration.

It is working fine except for the following issue:
Every time a DAG is launched, it copies the dependencies to the application
folder and hence spends a good amount of time before the app actually
starts running. All of the apps I run belong to the same project and hence
don't actually need separate set of jars.

Is there any way I can make all the applications "share" the jars which are
uploaded when the first application is run?

Thanks.

~ Bhupesh


Re: Schema Discovery Support in Apex Applications

2017-01-16 Thread AJAY GUPTA
+1 for the idea.

I just had one question.

As I understand, there will be some form of Anonymous POJO used as objects
to pass information from one operator to another. Can you share how the
user/operator developer would access the tuple object in case he wishes to
do something with it?


Ajay

On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkar 
wrote:

> Hi All,
>
> Currently a DAG that is generated by user, if contains any POJOfied
> operators, TUPLE_CLASS attribute needs to be set on each and every port
> which receives or sends a POJO.
>
> For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
> Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on
> both input and output ports of transform, dedup operators and also on
> parser output and formatter input.
>
> The proposal here is to reduce work that is required by user to configure
> the DAG. Technically speaking if an operators knows input schema and
> processing properties, it can determine output schema and convey it to
> downstream operators. This way the complete pipeline can be configured
> without user setting TUPLE_CLASS or even creating POJOs and adding them to
> classpath.
>
> On the same idea, I want to propose an approach where the pipeline can be
> configured without user setting TUPLE_CLASS or even creating POJOs and
> adding them to classpath.
> Here is the document which at a high level explains the idea and a high
> level design:
> https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
> tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
>
> I would like to get opinion from community about feasibility and
> applications of this proposal.
> Once we get some consensus we can discuss the design in details.
>
> Thanks,
> Chinmay.
>


[jira] [Resolved] (APEXMALHAR-2183) Add user document for CsvFormatter operator

2017-01-16 Thread shubham pathak (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shubham pathak resolved APEXMALHAR-2183.

Resolution: Fixed

> Add user document for CsvFormatter operator
> ---
>
> Key: APEXMALHAR-2183
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: Venkatesh Kottapalli
>Assignee: Venkatesh Kottapalli
>  Labels: newbie
> Fix For: 3.7.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Add user documentation for CsvFormatter operator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2183) Add user document for CsvFormatter operator

2017-01-16 Thread shubham pathak (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shubham pathak updated APEXMALHAR-2183:
---
Fix Version/s: 3.7.0

> Add user document for CsvFormatter operator
> ---
>
> Key: APEXMALHAR-2183
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: Venkatesh Kottapalli
>Assignee: Venkatesh Kottapalli
>  Labels: newbie
> Fix For: 3.7.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Add user documentation for CsvFormatter operator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2183) Add user document for CsvFormatter operator

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823698#comment-15823698
 ] 

ASF GitHub Bot commented on APEXMALHAR-2183:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/400


> Add user document for CsvFormatter operator
> ---
>
> Key: APEXMALHAR-2183
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183
> Project: Apache Apex Malhar
>  Issue Type: Task
>Reporter: Venkatesh Kottapalli
>Assignee: Venkatesh Kottapalli
>  Labels: newbie
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Add user documentation for CsvFormatter operator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #400: APEXMALHAR-2183 - Documentation for CsvFormat...

2017-01-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/400


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXMALHAR-2384) Add documentation for FixedWIdthParser

2017-01-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823673#comment-15823673
 ] 

ASF GitHub Bot commented on APEXMALHAR-2384:


GitHub user Hitesh-Scorpio opened a pull request:

https://github.com/apache/apex-malhar/pull/533

APEXMALHAR-2384 adding documentation for fixed width parser

@amberarrow please review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Hitesh-Scorpio/apex-malhar 
APEXMALHAR-2384_Documentation_Fixed_Width_Parser

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/533.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #533






> Add documentation for FixedWIdthParser
> --
>
> Key: APEXMALHAR-2384
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2384
> Project: Apache Apex Malhar
>  Issue Type: Documentation
>Reporter: Hitesh Kapoor
>Assignee: Hitesh Kapoor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Schema Discovery Support in Apex Applications

2017-01-16 Thread Chinmay Kolhatkar
Hi All,

Currently a DAG that is generated by user, if contains any POJOfied
operators, TUPLE_CLASS attribute needs to be set on each and every port
which receives or sends a POJO.

For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on
both input and output ports of transform, dedup operators and also on
parser output and formatter input.

The proposal here is to reduce work that is required by user to configure
the DAG. Technically speaking if an operators knows input schema and
processing properties, it can determine output schema and convey it to
downstream operators. This way the complete pipeline can be configured
without user setting TUPLE_CLASS or even creating POJOs and adding them to
classpath.

On the same idea, I want to propose an approach where the pipeline can be
configured without user setting TUPLE_CLASS or even creating POJOs and
adding them to classpath.
Here is the document which at a high level explains the idea and a high
level design:
https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing

I would like to get opinion from community about feasibility and
applications of this proposal.
Once we get some consensus we can discuss the design in details.

Thanks,
Chinmay.


Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

2017-01-16 Thread Bhupesh Chawda
Yes, if the user needs to develop a batch application, then batch aware
operators need to be used in the application.
The nature of the application is mostly controlled by the input and the
output operators used in the application.

For example, consider an application which needs to filter records in a
input file and store the filtered records in another file. The nature of
this app is to end once the entire file is processed. Following things are
expected of the application:

   1. Once the input data is over, finalize the output file from .tmp
   files. - Responsibility of output operator
   2. End the application, once the data is read and processed -
   Responsibility of input operator

These functions are essential to allow the user to do higher level
operations like scheduling or running a workflow of batch applications.

I am not sure about intermediate (processing) operators, as there is no
change in their functionality for batch use cases. Perhaps, allowing
running multiple batches in a single application may require similar
changes in processing operators as well.

~ Bhupesh

On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale  wrote:

> Will it make an impression on user that, if he has a batch usecase he has
> to use batch aware operators only? If so, is that what we expect? I am not
> aware of how do we implement batch scenario so this might be a basic
> question.
>
> -Priyanka
>
> On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda 
> wrote:
>
> > Hi All,
> >
> > While design / implementation for custom control tuples is ongoing, I
> > thought it would be a good idea to consider its usefulness in one of the
> > use cases -  batch applications.
> >
> > This is a proposal to adapt / extend existing operators in the Apache
> Apex
> > Malhar library so that it is easy to use them in batch use cases.
> > Naturally, this would be applicable for only a subset of operators like
> > File, JDBC and NoSQL databases.
> > For example, for a file based store, (say HDFS store), we could have
> > FileBatchInput and FileBatchOutput operators which allow easy integration
> > into a batch application. These operators would be extended from their
> > existing implementations and would be "Batch Aware", in that they may
> > understand the meaning of some specific control tuples that flow through
> > the DAG. Start batch and end batch seem to be the obvious candidates that
> > come to mind. On receipt of such control tuples, they may try to modify
> the
> > behavior of the operator - to reinitialize some metrics or finalize an
> > output file for example.
> >
> > We can discuss the potential control tuples and actions in detail, but
> > first I would like to understand the views of the community for this
> > proposal.
> >
> > ~ Bhupesh
> >
>


Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

2017-01-16 Thread Priyanka Gugale
Will it make an impression on user that, if he has a batch usecase he has
to use batch aware operators only? If so, is that what we expect? I am not
aware of how do we implement batch scenario so this might be a basic
question.

-Priyanka

On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda 
wrote:

> Hi All,
>
> While design / implementation for custom control tuples is ongoing, I
> thought it would be a good idea to consider its usefulness in one of the
> use cases -  batch applications.
>
> This is a proposal to adapt / extend existing operators in the Apache Apex
> Malhar library so that it is easy to use them in batch use cases.
> Naturally, this would be applicable for only a subset of operators like
> File, JDBC and NoSQL databases.
> For example, for a file based store, (say HDFS store), we could have
> FileBatchInput and FileBatchOutput operators which allow easy integration
> into a batch application. These operators would be extended from their
> existing implementations and would be "Batch Aware", in that they may
> understand the meaning of some specific control tuples that flow through
> the DAG. Start batch and end batch seem to be the obvious candidates that
> come to mind. On receipt of such control tuples, they may try to modify the
> behavior of the operator - to reinitialize some metrics or finalize an
> output file for example.
>
> We can discuss the potential control tuples and actions in detail, but
> first I would like to understand the views of the community for this
> proposal.
>
> ~ Bhupesh
>