[jira] [Commented] (APEXCORE-596) Committed method on operators not called when stream locality is THREAD_LOCAL
[ https://issues.apache.org/jira/browse/APEXCORE-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825574#comment-15825574 ] ASF GitHub Bot commented on APEXCORE-596: - Github user francisf closed the pull request at: https://github.com/apache/apex-core/pull/442 > Committed method on operators not called when stream locality is THREAD_LOCAL > - > > Key: APEXCORE-596 > URL: https://issues.apache.org/jira/browse/APEXCORE-596 > Project: Apache Apex Core > Issue Type: Bug >Affects Versions: 3.5.0 >Reporter: Francis Fernandes >Assignee: Francis Fernandes > > When the locality of the stream connecting the two operators is > Locality.THREAD_LOCAL, the committed method is not called for some operators. > These operators implement the Operator.CheckpointListener. e.g. > AbstractFileOutputOperator > For thread local during activate we do not set the thread in the node's > context > Because the thread is not set, we skip this operator in the > processHeartBeatResponse and the committed is not called > {code} > if (thread == null || !thread.isAlive()) { > continue; > } > {code} > We need this condition for invalid operators (operator failures) in case of > other localities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXCORE-596) Committed method on operators not called when stream locality is THREAD_LOCAL
[ https://issues.apache.org/jira/browse/APEXCORE-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825575#comment-15825575 ] ASF GitHub Bot commented on APEXCORE-596: - GitHub user francisf reopened a pull request: https://github.com/apache/apex-core/pull/442 APEXCORE-596 Setting the thread for all oio nodes in the oio group @tushargosavi please review You can merge this pull request into a Git repository by running: $ git pull https://github.com/francisf/apex-core APEXCORE-596_OioThreadSet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/442.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #442 commit 3debbb998cfbb1866df32fe77e959e28412916a0 Author: francisfDate: 2017-01-05T07:32:05Z APEXCORE-596 Setting the thread for all oio nodes in the oio group, refactoring tests > Committed method on operators not called when stream locality is THREAD_LOCAL > - > > Key: APEXCORE-596 > URL: https://issues.apache.org/jira/browse/APEXCORE-596 > Project: Apache Apex Core > Issue Type: Bug >Affects Versions: 3.5.0 >Reporter: Francis Fernandes >Assignee: Francis Fernandes > > When the locality of the stream connecting the two operators is > Locality.THREAD_LOCAL, the committed method is not called for some operators. > These operators implement the Operator.CheckpointListener. e.g. > AbstractFileOutputOperator > For thread local during activate we do not set the thread in the node's > context > Because the thread is not set, we skip this operator in the > processHeartBeatResponse and the committed is not called > {code} > if (thread == null || !thread.isAlive()) { > continue; > } > {code} > We need this condition for invalid operators (operator failures) in case of > other localities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-core pull request #442: APEXCORE-596 Setting the thread for all oio nod...
GitHub user francisf reopened a pull request: https://github.com/apache/apex-core/pull/442 APEXCORE-596 Setting the thread for all oio nodes in the oio group @tushargosavi please review You can merge this pull request into a Git repository by running: $ git pull https://github.com/francisf/apex-core APEXCORE-596_OioThreadSet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/442.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #442 commit 3debbb998cfbb1866df32fe77e959e28412916a0 Author: francisfDate: 2017-01-05T07:32:05Z APEXCORE-596 Setting the thread for all oio nodes in the oio group, refactoring tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] apex-core pull request #442: APEXCORE-596 Setting the thread for all oio nod...
Github user francisf closed the pull request at: https://github.com/apache/apex-core/pull/442 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Contribution Process before PR
I agree that the discussions are important. There are currently discussions being happening over mailing list but some topics might be get missed and we should improve there. The tasks should be brought to community before the design starts and then mature the design through community discussion. Though there are few of my observations that I would like to point out: 1. If I understand correctly, any development happening should still be bound by a time. This includes the discussions over mailing list OR on the PR as well. If the design discussion are happening over PR, then it means the mailing list communication is failing somewhere. 2. Moreover, the nature of discussion should be conclusive. There are some discussions which keeps happening over and over and takes time to reach a conclusion. IMO this is blocking the development and does not benefit anyone. 3. There are mailing threads (few but important ones) which have as much as 30+ email exchanges. After a point of time, I believe the crux of the topic is somewhere missed. If an email thread is going beyond 10 mails to reach a conclusion, then there is something wrong in way the discussion is heading and its is the responsibility of everyone and not just original author to "hold the thought" and re-align the discussion to reach a conclusion and consensus. 4. At the same time, there are possibly cases where PR comments gets a shape of design discussion because the original thread is not followed up completely. Everyone has enough to catch up hence one might miss following a certain mail thread but might point of some crucial points in PR comment. In such case, I think there should be an alternate way (maybe an offline call for author to clear the idea) might help. This is just to make sure that the progress is made in current task and should not be followed as an standard practice. 5. We as community should emphasis on the fact that the quality code is generated but also should be mindful if time it is taking. Its hard to get the right balance, hence it might be a good practice to break things down to phases instead of trying to do in a single go. In such case, the community should be highlighted that what is the first phase, second phase and so on On Tue, Jan 17, 2017 at 12:04 PM, Bhupesh Chawdawrote: > I agree. > The discussion on requirements very much helps folks to understand the need > for a feature and sometimes much more about related topics. From my own > experience, the discussion thread on requirements before the implementation > is what helps the design discussions and allows even people not aware of > the internals, to quickly pickup things with little effort. > > At the same time, I do feel the need for a prototype implementation to > clarify things, most of the times. But, I think this should be > supplementary to the discussions and not a replacement. > > ~ Bhupesh > > On Tue, Jan 17, 2017 at 7:50 AM, Sandesh Hegde > wrote: > > > I do see value in having the review PR along with the discussion in > > the mailing list/jira. > > > > It shows the commitment of the Author to the idea and to the project. > > There is some truth in what Linus Torvalds says > > https://lkml.org/lkml/2000/8/25/132 > > > > Thanks > > > > > > On Mon, Jan 16, 2017 at 8:11 AM Amol Kekre wrote: > > > > > I do see folks discussing issues for most part now. For me this does > not > > > look like an issue that is hurting Apex community. I do however want to > > > discuss cases where code gets blocked and get spilled into larger > > context. > > > As a culture and a process, we have a very high overhead that deters > > > contributions. > > > > > > Thks > > > Amol > > > > > > > > > On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneni < > pra...@datatorrent.com > > > > > > wrote: > > > > > > > Yes, it will be good to have these points added to the contributor > > > > guidelines but I also see for the most part folks do bring up issues > > for > > > > discussion, try to address concerns and come to a consensus and in > > > > generally participate in the community. I also think we should have > > some > > > > latitude in the process when it comes to bug fixes that are contained > > and > > > > don't spill into re-design of components otherwise, the overhead will > > > deter > > > > contributions especially from folks who are new to the project and > want > > > to > > > > start contributing by fixing low hanging bugs. > > > > > > > > Thanks > > > > > > > > On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise > wrote: > > > > > > > > > Hi, > > > > > > > > > > I want to propose additions to the contributor guidelines that > place > > > > > stronger emphasis on open collaboration and the early part of the > > > > > contribution process. > > > > > > > > > > Specifically, I would like to suggest that *thought process* and > > > *design > > > > > discussion* are more important than the
Re: Schema Discovery Support in Apex Applications
+1 for the feature. ~ Bhupesh On Mon, Jan 16, 2017 at 5:09 PM, Chinmay Kolhatkarwrote: > Those are not really anonymous POJOs... The definition of POJO will be > known to user as based on that only upstream operator will convey the tuple > type the operator will be emitting. > Using that information user can configure the operators. Those properties > will be a bit different though. > > On Mon, Jan 16, 2017 at 4:20 PM, AJAY GUPTA wrote: > > > +1 for the idea. > > > > I just had one question. > > > > As I understand, there will be some form of Anonymous POJO used as > objects > > to pass information from one operator to another. Can you share how the > > user/operator developer would access the tuple object in case he wishes > to > > do something with it? > > > > > > Ajay > > > > On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkar > > wrote: > > > > > Hi All, > > > > > > Currently a DAG that is generated by user, if contains any POJOfied > > > operators, TUPLE_CLASS attribute needs to be set on each and every port > > > which receives or sends a POJO. > > > > > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> > > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user > on > > > both input and output ports of transform, dedup operators and also on > > > parser output and formatter input. > > > > > > The proposal here is to reduce work that is required by user to > configure > > > the DAG. Technically speaking if an operators knows input schema and > > > processing properties, it can determine output schema and convey it to > > > downstream operators. This way the complete pipeline can be configured > > > without user setting TUPLE_CLASS or even creating POJOs and adding them > > to > > > classpath. > > > > > > On the same idea, I want to propose an approach where the pipeline can > be > > > configured without user setting TUPLE_CLASS or even creating POJOs and > > > adding them to classpath. > > > Here is the document which at a high level explains the idea and a high > > > level design: > > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ > > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing > > > > > > I would like to get opinion from community about feasibility and > > > applications of this proposal. > > > Once we get some consensus we can discuss the design in details. > > > > > > Thanks, > > > Chinmay. > > > > > >
Re: Contribution Process before PR
I agree. The discussion on requirements very much helps folks to understand the need for a feature and sometimes much more about related topics. From my own experience, the discussion thread on requirements before the implementation is what helps the design discussions and allows even people not aware of the internals, to quickly pickup things with little effort. At the same time, I do feel the need for a prototype implementation to clarify things, most of the times. But, I think this should be supplementary to the discussions and not a replacement. ~ Bhupesh On Tue, Jan 17, 2017 at 7:50 AM, Sandesh Hegdewrote: > I do see value in having the review PR along with the discussion in > the mailing list/jira. > > It shows the commitment of the Author to the idea and to the project. > There is some truth in what Linus Torvalds says > https://lkml.org/lkml/2000/8/25/132 > > Thanks > > > On Mon, Jan 16, 2017 at 8:11 AM Amol Kekre wrote: > > > I do see folks discussing issues for most part now. For me this does not > > look like an issue that is hurting Apex community. I do however want to > > discuss cases where code gets blocked and get spilled into larger > context. > > As a culture and a process, we have a very high overhead that deters > > contributions. > > > > Thks > > Amol > > > > > > On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneni > > > wrote: > > > > > Yes, it will be good to have these points added to the contributor > > > guidelines but I also see for the most part folks do bring up issues > for > > > discussion, try to address concerns and come to a consensus and in > > > generally participate in the community. I also think we should have > some > > > latitude in the process when it comes to bug fixes that are contained > and > > > don't spill into re-design of components otherwise, the overhead will > > deter > > > contributions especially from folks who are new to the project and want > > to > > > start contributing by fixing low hanging bugs. > > > > > > Thanks > > > > > > On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise wrote: > > > > > > > Hi, > > > > > > > > I want to propose additions to the contributor guidelines that place > > > > stronger emphasis on open collaboration and the early part of the > > > > contribution process. > > > > > > > > Specifically, I would like to suggest that *thought process* and > > *design > > > > discussion* are more important than the final code produced. It is > > > > necessary to develop the community and invest in the future of the > > > project. > > > > > > > > I start this discussion based on observation over time. I have seen > > cases > > > > (non trivial changes) where code and JIRAs appear at the same time, > > where > > > > the big picture is discussed after the PR is already open, or where > > > > information that would be valuable to other contributors or users > isn't > > > on > > > > record. > > > > > > > > Let's consider a non-trivial change or a feature. It would normally > > start > > > > with engagement on the mailing list to ensure time is well spent and > > the > > > > proposal is welcomed by the community, does not conflict with other > > > > initiatives etc. > > > > > > > > Once that is cleared, we would want to think about design, the how in > > the > > > > larger picture. In many cases that would involve discussion, > questions, > > > > suggestions, consensus building towards agreed approach. Or maybe it > is > > > > done through prototyping. In any case, before a PR is raised, it will > > be > > > > good to have as prerequisite that *thought process and approach have > > been > > > > documented*. I would prefer to see that on the JIRA, what do others > > > think? > > > > > > > > Benefits: > > > > > > > > * Contributor does not waste time and there is no frustration due to > a > > PR > > > > being turned down for reasons that could be avoided with upfront > > > > communication. > > > > > > > > * Contributor benefits from suggestions, questions, guidance of those > > > with > > > > in depth knowledge of particular areas. > > > > > > > > * Other community members have an opportunity to learn from > discussion, > > > the > > > > knowledge base broadens. > > > > > > > > * Information gets indexed, user later looking at JIRAs will find > > > valuable > > > > information on how certain problems were solved that they would never > > > > obtain from a PR. > > > > > > > > The ASF and "Apache Way", a read for the bigger picture with more > links > > > in > > > > it: > > > > http://krzysztof-sobkowiak.net/blog/celebrating-17-years- > > > > of-the-apache-software-foundation/ > > > > > > > > Looking forward to feedback and discussion, > > > > Thomas > > > > > > > > > >
Re: Extending DAG API for accessing objects in DAG.
I have opened https://issues.apache.org/jira/browse/APEXCORE-604 for this. If there is no objection I will open an pull request for it. - Tushar. On Mon, Jan 9, 2017 at 4:44 PM, Tushar Gosaviwrote: > Hi All, > > I am planing extend the DAG api to export internals of the DAG. > Current DAG does not provide a way to get the list of operators and > streams with their attributes. Also streamMeta does not provide > API to access end ports. > > This type of information is needed when external translator (like > Samoa) are constructing the DAG, > and we need to apply some transformation or configure the DAG before > it is run. Planing to extend > DAG API with following. > > InputPortMeta > public Operator.InputPort getPortObject(); > OutputPortMeta > public Operator.OutputPort getPortObject(); > streamMeta > public T getSource(); > public Collection getSinks(); > DAG > public abstract Collection getOperators(); > public abstract Collection getStreams(); > > Regards, > - Tushar.
Re: Sharing jars among different Apex apps on cluster
Yes Amol, the PR that Thomas pointed out is for that purpose. It allows to change the plan at runtime. We can remove all opertors entirely and add new operators to the existing dag. ~ Bhupesh On Jan 16, 2017 23:38, "Amol Kekre"wrote: > Bhupesh, > If stram changes the logical dag (runs only the sub-dag) on each runs, it > should be ok. > > Thks > Amol > > > On Mon, Jan 16, 2017 at 9:18 AM, Bhupesh Chawda > wrote: > > > Yes, I thought of that. That way it will be a single application > > throughout. > > Just wanted to see if there can be other options. > > > > ~ Bhupesh > > > > On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weise wrote: > > > > > Sounds like a fit for https://github.com/apache/apex-core/pull/410 ? > > > > > > On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda < > bhup...@datatorrent.com > > > > > > wrote: > > > > > > > Hi All, > > > > > > > > We have a use case where I need to launch a number of DAGs on the > > cluster > > > > one after the other in sequence programatically. > > > > > > > > We are using the StramAppLauncher and StramAppFactory classes to > > launch a > > > > DAG programatically on the cluster and adding any third party > > > dependencies > > > > as part of the configuration. > > > > > > > > It is working fine except for the following issue: > > > > Every time a DAG is launched, it copies the dependencies to the > > > application > > > > folder and hence spends a good amount of time before the app actually > > > > starts running. All of the apps I run belong to the same project and > > > hence > > > > don't actually need separate set of jars. > > > > > > > > Is there any way I can make all the applications "share" the jars > which > > > are > > > > uploaded when the first application is run? > > > > > > > > Thanks. > > > > > > > > ~ Bhupesh > > > > > > > > > >
Re: Suggestion on optimise kryo Output
Kryo is used in a default implementation of the StreamCodec interface. Ideally, if the StreamCodec interface itself allows the buffer to be passed then we can also send the buffer from the BufferServer in future. On Mon, Jan 9, 2017 at 4:10 PM Bright Chenwrote: > Hi, > > The kryo Output has some limitation > >- The size of the data is limited. kryo write data to a fixed buffer, it >will throw the overflow exception if the data exceed the size >- The Output.toBytes() will copy the data to temporary buffer and >output, it will decrease the performance and introduce garbage > collection. > > When I was tuning Spillable Data structure and Manage State. I create a > mechanism to share and reuse the memory to avoid above problem. And it can > be reused in core serialization with small change. Please see jira: > https://issues.apache.org/jira/browse/APEXMALHAR-2190 > > > Any suggestion or comments, please put in jira: > https://issues.apache.org/jira/browse/APEXCORE-606 > > > thanks > > Bright >
[jira] [Commented] (APEXCORE-570) Prevent upstream operators from getting too far ahead when downstream operators are slow
[ https://issues.apache.org/jira/browse/APEXCORE-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824628#comment-15824628 ] Pramod Immaneni commented on APEXCORE-570: -- I see your point that you could disable back pressure on the bursty one and get a better throughput. > Prevent upstream operators from getting too far ahead when downstream > operators are slow > > > Key: APEXCORE-570 > URL: https://issues.apache.org/jira/browse/APEXCORE-570 > Project: Apache Apex Core > Issue Type: Improvement >Reporter: Pramod Immaneni >Assignee: Pramod Immaneni > > If the downstream operators are slower than upstream operators then the > upstream operators will get ahead and the gap can continue to increase. > Provide an option to slow down or temporarily pause the upstream operators > when they get too far ahead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXCORE-570) Prevent upstream operators from getting too far ahead when downstream operators are slow
[ https://issues.apache.org/jira/browse/APEXCORE-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824455#comment-15824455 ] Pramod Immaneni commented on APEXCORE-570: -- "Consider the case of two subscribers with same throughput but different pattern. Once with constant and capped rate, other with bursts of activity. If they are forced to read at the same rate, then both will be slower and overall throughput reduce? Granted this is not the common scenario, but something to consider." Yes it would result in a slower throughput but since the operator is controlled by a single thread I think back pressure over one stream will block the other. > Prevent upstream operators from getting too far ahead when downstream > operators are slow > > > Key: APEXCORE-570 > URL: https://issues.apache.org/jira/browse/APEXCORE-570 > Project: Apache Apex Core > Issue Type: Improvement >Reporter: Pramod Immaneni >Assignee: Pramod Immaneni > > If the downstream operators are slower than upstream operators then the > upstream operators will get ahead and the gap can continue to increase. > Provide an option to slow down or temporarily pause the upstream operators > when they get too far ahead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Sharing jars among different Apex apps on cluster
Bhupesh, If stram changes the logical dag (runs only the sub-dag) on each runs, it should be ok. Thks Amol On Mon, Jan 16, 2017 at 9:18 AM, Bhupesh Chawdawrote: > Yes, I thought of that. That way it will be a single application > throughout. > Just wanted to see if there can be other options. > > ~ Bhupesh > > On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weise wrote: > > > Sounds like a fit for https://github.com/apache/apex-core/pull/410 ? > > > > On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda > > > wrote: > > > > > Hi All, > > > > > > We have a use case where I need to launch a number of DAGs on the > cluster > > > one after the other in sequence programatically. > > > > > > We are using the StramAppLauncher and StramAppFactory classes to > launch a > > > DAG programatically on the cluster and adding any third party > > dependencies > > > as part of the configuration. > > > > > > It is working fine except for the following issue: > > > Every time a DAG is launched, it copies the dependencies to the > > application > > > folder and hence spends a good amount of time before the app actually > > > starts running. All of the apps I run belong to the same project and > > hence > > > don't actually need separate set of jars. > > > > > > Is there any way I can make all the applications "share" the jars which > > are > > > uploaded when the first application is run? > > > > > > Thanks. > > > > > > ~ Bhupesh > > > > > >
Re: Sharing jars among different Apex apps on cluster
Yes, I thought of that. That way it will be a single application throughout. Just wanted to see if there can be other options. ~ Bhupesh On Mon, Jan 16, 2017 at 9:17 PM, Thomas Weisewrote: > Sounds like a fit for https://github.com/apache/apex-core/pull/410 ? > > On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawda > wrote: > > > Hi All, > > > > We have a use case where I need to launch a number of DAGs on the cluster > > one after the other in sequence programatically. > > > > We are using the StramAppLauncher and StramAppFactory classes to launch a > > DAG programatically on the cluster and adding any third party > dependencies > > as part of the configuration. > > > > It is working fine except for the following issue: > > Every time a DAG is launched, it copies the dependencies to the > application > > folder and hence spends a good amount of time before the app actually > > starts running. All of the apps I run belong to the same project and > hence > > don't actually need separate set of jars. > > > > Is there any way I can make all the applications "share" the jars which > are > > uploaded when the first application is run? > > > > Thanks. > > > > ~ Bhupesh > > >
Re: Contribution Process before PR
I do see folks discussing issues for most part now. For me this does not look like an issue that is hurting Apex community. I do however want to discuss cases where code gets blocked and get spilled into larger context. As a culture and a process, we have a very high overhead that deters contributions. Thks Amol On Sun, Jan 15, 2017 at 8:50 PM, Pramod Immaneniwrote: > Yes, it will be good to have these points added to the contributor > guidelines but I also see for the most part folks do bring up issues for > discussion, try to address concerns and come to a consensus and in > generally participate in the community. I also think we should have some > latitude in the process when it comes to bug fixes that are contained and > don't spill into re-design of components otherwise, the overhead will deter > contributions especially from folks who are new to the project and want to > start contributing by fixing low hanging bugs. > > Thanks > > On Sun, Jan 15, 2017 at 7:50 PM, Thomas Weise wrote: > > > Hi, > > > > I want to propose additions to the contributor guidelines that place > > stronger emphasis on open collaboration and the early part of the > > contribution process. > > > > Specifically, I would like to suggest that *thought process* and *design > > discussion* are more important than the final code produced. It is > > necessary to develop the community and invest in the future of the > project. > > > > I start this discussion based on observation over time. I have seen cases > > (non trivial changes) where code and JIRAs appear at the same time, where > > the big picture is discussed after the PR is already open, or where > > information that would be valuable to other contributors or users isn't > on > > record. > > > > Let's consider a non-trivial change or a feature. It would normally start > > with engagement on the mailing list to ensure time is well spent and the > > proposal is welcomed by the community, does not conflict with other > > initiatives etc. > > > > Once that is cleared, we would want to think about design, the how in the > > larger picture. In many cases that would involve discussion, questions, > > suggestions, consensus building towards agreed approach. Or maybe it is > > done through prototyping. In any case, before a PR is raised, it will be > > good to have as prerequisite that *thought process and approach have been > > documented*. I would prefer to see that on the JIRA, what do others > think? > > > > Benefits: > > > > * Contributor does not waste time and there is no frustration due to a PR > > being turned down for reasons that could be avoided with upfront > > communication. > > > > * Contributor benefits from suggestions, questions, guidance of those > with > > in depth knowledge of particular areas. > > > > * Other community members have an opportunity to learn from discussion, > the > > knowledge base broadens. > > > > * Information gets indexed, user later looking at JIRAs will find > valuable > > information on how certain problems were solved that they would never > > obtain from a PR. > > > > The ASF and "Apache Way", a read for the bigger picture with more links > in > > it: > > http://krzysztof-sobkowiak.net/blog/celebrating-17-years- > > of-the-apache-software-foundation/ > > > > Looking forward to feedback and discussion, > > Thomas > > >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
Bhupesh, Please see how that can be solved in a unified way using windows and watermarks. It is bounded data vs. unbounded data. In Beam for example, you can use the "global window" and the final watermark to accomplish what you are looking for. Batch is just a special case of streaming where the source emits the final watermark. Thanks, Thomas On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawdawrote: > Yes, if the user needs to develop a batch application, then batch aware > operators need to be used in the application. > The nature of the application is mostly controlled by the input and the > output operators used in the application. > > For example, consider an application which needs to filter records in a > input file and store the filtered records in another file. The nature of > this app is to end once the entire file is processed. Following things are > expected of the application: > >1. Once the input data is over, finalize the output file from .tmp >files. - Responsibility of output operator >2. End the application, once the data is read and processed - >Responsibility of input operator > > These functions are essential to allow the user to do higher level > operations like scheduling or running a workflow of batch applications. > > I am not sure about intermediate (processing) operators, as there is no > change in their functionality for batch use cases. Perhaps, allowing > running multiple batches in a single application may require similar > changes in processing operators as well. > > ~ Bhupesh > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale > wrote: > > > Will it make an impression on user that, if he has a batch usecase he has > > to use batch aware operators only? If so, is that what we expect? I am > not > > aware of how do we implement batch scenario so this might be a basic > > question. > > > > -Priyanka > > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda < > bhup...@datatorrent.com> > > wrote: > > > > > Hi All, > > > > > > While design / implementation for custom control tuples is ongoing, I > > > thought it would be a good idea to consider its usefulness in one of > the > > > use cases - batch applications. > > > > > > This is a proposal to adapt / extend existing operators in the Apache > > Apex > > > Malhar library so that it is easy to use them in batch use cases. > > > Naturally, this would be applicable for only a subset of operators like > > > File, JDBC and NoSQL databases. > > > For example, for a file based store, (say HDFS store), we could have > > > FileBatchInput and FileBatchOutput operators which allow easy > integration > > > into a batch application. These operators would be extended from their > > > existing implementations and would be "Batch Aware", in that they may > > > understand the meaning of some specific control tuples that flow > through > > > the DAG. Start batch and end batch seem to be the obvious candidates > that > > > come to mind. On receipt of such control tuples, they may try to modify > > the > > > behavior of the operator - to reinitialize some metrics or finalize an > > > output file for example. > > > > > > We can discuss the potential control tuples and actions in detail, but > > > first I would like to understand the views of the community for this > > > proposal. > > > > > > ~ Bhupesh > > > > > >
Re: Sharing jars among different Apex apps on cluster
Sounds like a fit for https://github.com/apache/apex-core/pull/410 ? On Mon, Jan 16, 2017 at 3:27 AM, Bhupesh Chawdawrote: > Hi All, > > We have a use case where I need to launch a number of DAGs on the cluster > one after the other in sequence programatically. > > We are using the StramAppLauncher and StramAppFactory classes to launch a > DAG programatically on the cluster and adding any third party dependencies > as part of the configuration. > > It is working fine except for the following issue: > Every time a DAG is launched, it copies the dependencies to the application > folder and hence spends a good amount of time before the app actually > starts running. All of the apps I run belong to the same project and hence > don't actually need separate set of jars. > > Is there any way I can make all the applications "share" the jars which are > uploaded when the first application is run? > > Thanks. > > ~ Bhupesh >
Re: APEXMALHAR-2382 User needs to create dt_meta table while using JdbcPOJOInsertOutputOperator
For JDBC exactly-once results the window meta data need to be committed along with the user data, hence the need for additional schema in the target database. There have been webinar and blogs about processing guarantees, please have a look at those. Table creation cannot be automatic by default. As I already said on the JIRA it could be an option for testing or for development, it just needs to be off by default. I would like to see the table name changed. If the library can generate the DDL for the table for the target database, then possibly when the table is missing it can output that along with instructions into the log to make things easier for users. It might also be a good idea to document the setup steps for multiple operators to write to the same schema. Thanks On Sun, Jan 15, 2017 at 11:55 PM, Devendra Tagarewrote: > Hi, > > -1 on auto table creation.Reasons have been eloquently elaborated in the > earlier posts. > > +1 on revisiting the approach taken for exactly once in the > JDBCPollInputOperator. > > One way could be to move the JDBC read and write operators into a separate > module (like apex-malhar/kafka) and add statistics, metrics,meta-data > features along similar lines.This module can have required concrete > implementations for mysql, psql etc.. > > Thanks, > Dev > > On Sun, Jan 15, 2017 at 10:34 PM, Chinmay Kolhatkar > wrote: > > > -1 for automatic schema creation... > > > > Moreover, I am wondering whether asking user to create a dt_meta table is > > right way. From an admins perspective, an ask for creation of meta table > > looks wrong to me. dt_meta table is created for the purpose of exactly > once > > but it does not hold any user data.. On this logic admin might deny > > developers for creation of table. > > > > I suggest to start a separate thread to do exactly once for JDBC insert > in > > a cleaner way. We take take a look at Kafka or File outputs to see how > > they've done to achieve exactly once without creating a meta location at > > destination. > > > > -Chinmay. > > > > > > On Mon, Jan 16, 2017 at 11:16 AM, Pradeep Kumbhar < > prad...@datatorrent.com > > > > > wrote: > > > > > +1 on having operator documentation explicitly mentioning that, > "dt_meta" > > > table is mandatory > > > for the operator to work correctly. Also provide a sample table > creation > > > query for reference. > > > > > > On Sat, Jan 14, 2017 at 1:05 PM, AJAY GUPTA > > wrote: > > > > > > > Since the query can be different for different databases, the user > will > > > > have to provide query to the operator. Rather than this, I believe > it's > > > > easier for user to directly execute create table query on DB. > > > > > > > > Also, the create table script won't be that heavy that we create > script > > > for > > > > it. Probably adding a generic type of query in the docs itself should > > > > suffice. > > > > > > > > > > > > Ajay > > > > > > > > On Sat, 14 Jan 2017 at 10:27 AM, Yogi Devendra < > > yogideven...@apache.org> > > > > wrote: > > > > > > > > > As Aniruddha pointed out, table creation should be done by dbadmin. > > > > > > > > > > In that case, utility script will be helpful. > > > > > > > > > > > > > > > > > > > > If we embed this code inside operator or application; then it will > be > > > > > > > > > > difficult for dbadmin to use it. > > > > > > > > > > > > > > > > > > > > ~ Yogi > > > > > > > > > > > > > > > > > > > > On 14 January 2017 at 03:43, Thomas Weise wrote: > > > > > > > > > > > > > > > > > > > > > -1 for automatic schema modification, unless the user asked for > it. > > > See > > > > > > > > > > > comment on JIRA. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 13, 2017 at 5:11 AM, Aniruddha Thombare < > > > > > > > > > > > anirud...@datatorrent.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > The tables should be created / altered by dbadmin. > > > > > > > > > > > > We shouldn't worry about table creations as its one-time > > activity. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > A > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _ > > > > > > > > > > > > Sent with difficulty, I mean handheld ;) > > > > > > > > > > > > > > > > > > > > > > > > On 13 Jan 2017 6:37 pm, "Yogi Devendra" < > yogideven...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > I am not very keen on having utility script. > > > > > > > > > > > > But, "no side-effects without explicit ask by the end-user" is > > > > > important. > > > > > > > > > > > > > > > > > > > > > > > > ~ Yogi > > > > > > > > > > > > > > > > > > > > > > > > On 13 January 2017 at 16:44, Priyanka Gugale < > pri...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > IMO it's
Re: Schema Discovery Support in Apex Applications
Those are not really anonymous POJOs... The definition of POJO will be known to user as based on that only upstream operator will convey the tuple type the operator will be emitting. Using that information user can configure the operators. Those properties will be a bit different though. On Mon, Jan 16, 2017 at 4:20 PM, AJAY GUPTAwrote: > +1 for the idea. > > I just had one question. > > As I understand, there will be some form of Anonymous POJO used as objects > to pass information from one operator to another. Can you share how the > user/operator developer would access the tuple object in case he wishes to > do something with it? > > > Ajay > > On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkar > wrote: > > > Hi All, > > > > Currently a DAG that is generated by user, if contains any POJOfied > > operators, TUPLE_CLASS attribute needs to be set on each and every port > > which receives or sends a POJO. > > > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on > > both input and output ports of transform, dedup operators and also on > > parser output and formatter input. > > > > The proposal here is to reduce work that is required by user to configure > > the DAG. Technically speaking if an operators knows input schema and > > processing properties, it can determine output schema and convey it to > > downstream operators. This way the complete pipeline can be configured > > without user setting TUPLE_CLASS or even creating POJOs and adding them > to > > classpath. > > > > On the same idea, I want to propose an approach where the pipeline can be > > configured without user setting TUPLE_CLASS or even creating POJOs and > > adding them to classpath. > > Here is the document which at a high level explains the idea and a high > > level design: > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing > > > > I would like to get opinion from community about feasibility and > > applications of this proposal. > > Once we get some consensus we can discuss the design in details. > > > > Thanks, > > Chinmay. > > >
Sharing jars among different Apex apps on cluster
Hi All, We have a use case where I need to launch a number of DAGs on the cluster one after the other in sequence programatically. We are using the StramAppLauncher and StramAppFactory classes to launch a DAG programatically on the cluster and adding any third party dependencies as part of the configuration. It is working fine except for the following issue: Every time a DAG is launched, it copies the dependencies to the application folder and hence spends a good amount of time before the app actually starts running. All of the apps I run belong to the same project and hence don't actually need separate set of jars. Is there any way I can make all the applications "share" the jars which are uploaded when the first application is run? Thanks. ~ Bhupesh
Re: Schema Discovery Support in Apex Applications
+1 for the idea. I just had one question. As I understand, there will be some form of Anonymous POJO used as objects to pass information from one operator to another. Can you share how the user/operator developer would access the tuple object in case he wishes to do something with it? Ajay On Mon, Jan 16, 2017 at 2:53 PM, Chinmay Kolhatkarwrote: > Hi All, > > Currently a DAG that is generated by user, if contains any POJOfied > operators, TUPLE_CLASS attribute needs to be set on each and every port > which receives or sends a POJO. > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on > both input and output ports of transform, dedup operators and also on > parser output and formatter input. > > The proposal here is to reduce work that is required by user to configure > the DAG. Technically speaking if an operators knows input schema and > processing properties, it can determine output schema and convey it to > downstream operators. This way the complete pipeline can be configured > without user setting TUPLE_CLASS or even creating POJOs and adding them to > classpath. > > On the same idea, I want to propose an approach where the pipeline can be > configured without user setting TUPLE_CLASS or even creating POJOs and > adding them to classpath. > Here is the document which at a high level explains the idea and a high > level design: > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing > > I would like to get opinion from community about feasibility and > applications of this proposal. > Once we get some consensus we can discuss the design in details. > > Thanks, > Chinmay. >
[jira] [Resolved] (APEXMALHAR-2183) Add user document for CsvFormatter operator
[ https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shubham pathak resolved APEXMALHAR-2183. Resolution: Fixed > Add user document for CsvFormatter operator > --- > > Key: APEXMALHAR-2183 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Labels: newbie > Fix For: 3.7.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Add user documentation for CsvFormatter operator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (APEXMALHAR-2183) Add user document for CsvFormatter operator
[ https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shubham pathak updated APEXMALHAR-2183: --- Fix Version/s: 3.7.0 > Add user document for CsvFormatter operator > --- > > Key: APEXMALHAR-2183 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Labels: newbie > Fix For: 3.7.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Add user documentation for CsvFormatter operator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXMALHAR-2183) Add user document for CsvFormatter operator
[ https://issues.apache.org/jira/browse/APEXMALHAR-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823698#comment-15823698 ] ASF GitHub Bot commented on APEXMALHAR-2183: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/400 > Add user document for CsvFormatter operator > --- > > Key: APEXMALHAR-2183 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2183 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Labels: newbie > Original Estimate: 72h > Remaining Estimate: 72h > > Add user documentation for CsvFormatter operator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #400: APEXMALHAR-2183 - Documentation for CsvFormat...
Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/400 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2384) Add documentation for FixedWIdthParser
[ https://issues.apache.org/jira/browse/APEXMALHAR-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823673#comment-15823673 ] ASF GitHub Bot commented on APEXMALHAR-2384: GitHub user Hitesh-Scorpio opened a pull request: https://github.com/apache/apex-malhar/pull/533 APEXMALHAR-2384 adding documentation for fixed width parser @amberarrow please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Hitesh-Scorpio/apex-malhar APEXMALHAR-2384_Documentation_Fixed_Width_Parser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/533.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #533 > Add documentation for FixedWIdthParser > -- > > Key: APEXMALHAR-2384 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2384 > Project: Apache Apex Malhar > Issue Type: Documentation >Reporter: Hitesh Kapoor >Assignee: Hitesh Kapoor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Schema Discovery Support in Apex Applications
Hi All, Currently a DAG that is generated by user, if contains any POJOfied operators, TUPLE_CLASS attribute needs to be set on each and every port which receives or sends a POJO. For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on both input and output ports of transform, dedup operators and also on parser output and formatter input. The proposal here is to reduce work that is required by user to configure the DAG. Technically speaking if an operators knows input schema and processing properties, it can determine output schema and convey it to downstream operators. This way the complete pipeline can be configured without user setting TUPLE_CLASS or even creating POJOs and adding them to classpath. On the same idea, I want to propose an approach where the pipeline can be configured without user setting TUPLE_CLASS or even creating POJOs and adding them to classpath. Here is the document which at a high level explains the idea and a high level design: https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing I would like to get opinion from community about feasibility and applications of this proposal. Once we get some consensus we can discuss the design in details. Thanks, Chinmay.
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
Yes, if the user needs to develop a batch application, then batch aware operators need to be used in the application. The nature of the application is mostly controlled by the input and the output operators used in the application. For example, consider an application which needs to filter records in a input file and store the filtered records in another file. The nature of this app is to end once the entire file is processed. Following things are expected of the application: 1. Once the input data is over, finalize the output file from .tmp files. - Responsibility of output operator 2. End the application, once the data is read and processed - Responsibility of input operator These functions are essential to allow the user to do higher level operations like scheduling or running a workflow of batch applications. I am not sure about intermediate (processing) operators, as there is no change in their functionality for batch use cases. Perhaps, allowing running multiple batches in a single application may require similar changes in processing operators as well. ~ Bhupesh On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugalewrote: > Will it make an impression on user that, if he has a batch usecase he has > to use batch aware operators only? If so, is that what we expect? I am not > aware of how do we implement batch scenario so this might be a basic > question. > > -Priyanka > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda > wrote: > > > Hi All, > > > > While design / implementation for custom control tuples is ongoing, I > > thought it would be a good idea to consider its usefulness in one of the > > use cases - batch applications. > > > > This is a proposal to adapt / extend existing operators in the Apache > Apex > > Malhar library so that it is easy to use them in batch use cases. > > Naturally, this would be applicable for only a subset of operators like > > File, JDBC and NoSQL databases. > > For example, for a file based store, (say HDFS store), we could have > > FileBatchInput and FileBatchOutput operators which allow easy integration > > into a batch application. These operators would be extended from their > > existing implementations and would be "Batch Aware", in that they may > > understand the meaning of some specific control tuples that flow through > > the DAG. Start batch and end batch seem to be the obvious candidates that > > come to mind. On receipt of such control tuples, they may try to modify > the > > behavior of the operator - to reinitialize some metrics or finalize an > > output file for example. > > > > We can discuss the potential control tuples and actions in detail, but > > first I would like to understand the views of the community for this > > proposal. > > > > ~ Bhupesh > > >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
Will it make an impression on user that, if he has a batch usecase he has to use batch aware operators only? If so, is that what we expect? I am not aware of how do we implement batch scenario so this might be a basic question. -Priyanka On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawdawrote: > Hi All, > > While design / implementation for custom control tuples is ongoing, I > thought it would be a good idea to consider its usefulness in one of the > use cases - batch applications. > > This is a proposal to adapt / extend existing operators in the Apache Apex > Malhar library so that it is easy to use them in batch use cases. > Naturally, this would be applicable for only a subset of operators like > File, JDBC and NoSQL databases. > For example, for a file based store, (say HDFS store), we could have > FileBatchInput and FileBatchOutput operators which allow easy integration > into a batch application. These operators would be extended from their > existing implementations and would be "Batch Aware", in that they may > understand the meaning of some specific control tuples that flow through > the DAG. Start batch and end batch seem to be the obvious candidates that > come to mind. On receipt of such control tuples, they may try to modify the > behavior of the operator - to reinitialize some metrics or finalize an > output file for example. > > We can discuss the potential control tuples and actions in detail, but > first I would like to understand the views of the community for this > proposal. > > ~ Bhupesh >