Re: Proposal: CompositeAccumulation for Windowed Operator
Thanks Bright. I've reviewed your PR. It looks good.. Just a minor change required. Please see my comment there. On Mon, Feb 27, 2017 at 11:26 PM, Bright Chenwrote: > A jira created: https://issues.apache.org/jira/browse/APEXMALHAR-2428 > > > On Mon, Feb 27, 2017 at 9:53 AM, Bright Chen > wrote: > > > I think Chimay's proposal could make application more clear and increase > > the performance as locate of key/window cost most of time. > > > > A suggested usage for Composite Accumulation could as following: > > > > *//following is the sample code how to add sub accumulations* > > > > *CompositeAccumulation accumulations = new > > CompositeAccumulation<>();* > > > > *AccumulationTag sumTag = > > accumulations.addAccumulation((Accumulation)new SumAccumulation());* > > > > *AccumulationTag countTag = > > accumulations.addAccumulation((Accumulation)new Count());* > > > > *AccumulationTag maxTag = accumulations.addAccumulation(new Max());* > > > > *AccumulationTag minTag = accumulations.addAccumulation(new Min());* > > > > *//following is the sample how to get the sub-accumulation output* > > > > *accumulations.getSubOutput(sumTag, outputValues)* > > > > *accumulations.getSubOutput(countTag, outputValues)* > > > > *accumulations.getSubOutput(maxTag, outputValues)* > > > > *accumulations.getSubOutput(minTag, outputValues)* > > > > > > Thanks > > > > Bright > > > > On Sun, Feb 26, 2017 at 10:33 PM, Chinmay Kolhatkar < > > chin...@datatorrent.com> wrote: > > > >> Dear Community, > >> > >> Currently we have accumulations for individual types of accumulations. > >> But if one wants to do more than one accumulations in a single stage of > >> Windowed Operator it is not possible. > >> > >> I want to propose an idea about "CompositeAccumulation" where more than > >> one > >> accumulation can be configured and this accoumulation can relay on > >> multiple > >> accumulations to generate final result/output. > >> > >> The output can be either of the 2 forms: > >> 1. Just the list of outputs with AccumulationTags as identifiers. > >> 2. Merge the results of multiple accumulations using some user defined > >> logic. > >> For eg. In aggregation case, Input POJO to this accumulation can > be a > >> POJO containing NumberOfOrders as field and in output one might need to > >> generate a final(single) POJO which contains result of multiple > >> accumulations like SUM, COUNT on NumberOfOrders as different fields of > >> outgoing POJO. > >> > >> I particularly see the use of this for Multiple Aggregation which we > would > >> like to do in SQL on Apex Integration. > >> > >> Please share your thoughts on the same. > >> > >> Thanks, > >> Chinmay. > >> > > > > >
Re: Maven build package error with 3.5
The root cause seems to be embedded in your output: Could not transfer artifact org.apache.apex:malhar-library:pom:3.5.0 from/to central ( https://repo.maven.apache.org/maven2): sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target -> [Help 1] Could this be because a corporate firewall is stopping this access? You can check this out http://stackoverflow.com/questions/25911623/problems-using-maven-and-ssl-behind-proxy On Mon, Feb 27, 2017 at 3:15 PM, Dongming Liangwrote: > It was running well with 3.4, but now failing with Apex 3.5 > > ➜ log-aggregator git:(apex-tcp) ✗ mvn package -DskipTests > [INFO] Scanning for projects... > [INFO] > [INFO] > > [INFO] Building Aggregator 1.0-SNAPSHOT > [INFO] > > Downloading: > https://repo.maven.apache.org/maven2/org/apache/apex/malhar- > library/3.5.0/malhar-library-3.5.0.pom > Downloading: > https://repo.maven.apache.org/maven2/org/apache/apex/apex- > api/3.5.0/apex-api-3.5.0.pom > Downloading: > https://repo.maven.apache.org/maven2/org/apache/apex/apex- > common/3.5.0/apex-common-3.5.0.pom > Downloading: > https://repo.maven.apache.org/maven2/org/apache/apex/apex- > engine/3.5.0/apex-engine-3.5.0.pom > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 10.435 s > [INFO] Finished at: 2017-02-27T15:03:06-08:00 > [INFO] Final Memory: 11M/245M > [INFO] > > [ERROR] Failed to execute goal on project log-aggregator: Could not resolve > dependencies for project > com.capitalone.vault8:log-aggregator:jar:1.0-SNAPSHOT: Failed to collect > dependencies at org.apache.apex:malhar-library:jar:3.5.0: Failed to read > artifact descriptor for org.apache.apex:malhar-library:jar:3.5.0: Could > not > transfer artifact org.apache.apex:malhar-library:pom:3.5.0 from/to > central ( > https://repo.maven.apache.org/maven2): > sun.security.validator.ValidatorException: PKIX path building failed: > sun.security.provider.certpath.SunCertPathBuilderException: unable to find > valid certification path to requested target -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/ > DependencyResolutionException > ➜ log-aggregator git:(apex-tcp) ✗ > > The pom file is: > > > http://maven.apache.org/POM/4.0.0; > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 > http://maven.apache.org/xsd/maven-4.0.0.xsd;> > 4.0.0 > > com.mycom.teamx > log-aggregator > jar > 1.0-SNAPSHOT > > > Aggregator > Log Aggregator > > > > 3.5.0 > lib/*.jar > > > > > > org.apache.maven.plugins > maven-eclipse-plugin > 2.9 > >true > > > > maven-compiler-plugin > 3.3 > >UTF-8 >1.7 >1.7 >true >false >true >true > > > > maven-dependency-plugin > 2.8 > > > copy-dependencies > prepare-package > >copy-dependencies > > >target/deps >runtime > > > > > > > maven-assembly-plugin > > > app-package-assembly > package > >single > > >${project.artifactId}-${project.version} > -apexapp >false > > src/assemble/appPackage.xml > > > 0755 > > > >${apex.apppackage.classpath} >${apex.version} > > ${project.groupId} > > ${project.artifactId} > > ${project.version} > > ${project.name} > > ${project.description} Package-Description> > > > > > > > > > maven-antrun-plugin > 1.7 > > > package > > >
Maven build package error with 3.5
It was running well with 3.4, but now failing with Apex 3.5 ➜ log-aggregator git:(apex-tcp) ✗ mvn package -DskipTests [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Aggregator 1.0-SNAPSHOT [INFO] Downloading: https://repo.maven.apache.org/maven2/org/apache/apex/malhar-library/3.5.0/malhar-library-3.5.0.pom Downloading: https://repo.maven.apache.org/maven2/org/apache/apex/apex-api/3.5.0/apex-api-3.5.0.pom Downloading: https://repo.maven.apache.org/maven2/org/apache/apex/apex-common/3.5.0/apex-common-3.5.0.pom Downloading: https://repo.maven.apache.org/maven2/org/apache/apex/apex-engine/3.5.0/apex-engine-3.5.0.pom [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 10.435 s [INFO] Finished at: 2017-02-27T15:03:06-08:00 [INFO] Final Memory: 11M/245M [INFO] [ERROR] Failed to execute goal on project log-aggregator: Could not resolve dependencies for project com.capitalone.vault8:log-aggregator:jar:1.0-SNAPSHOT: Failed to collect dependencies at org.apache.apex:malhar-library:jar:3.5.0: Failed to read artifact descriptor for org.apache.apex:malhar-library:jar:3.5.0: Could not transfer artifact org.apache.apex:malhar-library:pom:3.5.0 from/to central ( https://repo.maven.apache.org/maven2): sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException ➜ log-aggregator git:(apex-tcp) ✗ The pom file is: http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 com.mycom.teamx log-aggregator jar 1.0-SNAPSHOT Aggregator Log Aggregator 3.5.0 lib/*.jar org.apache.maven.plugins maven-eclipse-plugin 2.9 true maven-compiler-plugin 3.3 UTF-8 1.7 1.7 true false true true maven-dependency-plugin 2.8 copy-dependencies prepare-package copy-dependencies target/deps runtime maven-assembly-plugin app-package-assembly package single ${project.artifactId}-${project.version}-apexapp false src/assemble/appPackage.xml 0755 ${apex.apppackage.classpath} ${apex.version} ${project.groupId} ${project.artifactId} ${project.version} ${project.name} ${project.description} maven-antrun-plugin 1.7 package run org.apache.apex malhar-library ${apex.version} * * org.apache.apex apex-api ${apex.version} org.apache.apex apex-common ${apex.version} * * junit junit 4.10 test org.apache.apex apex-engine ${apex.version} test * * joda-time joda-time 2.9.1 javax.validation validation-api 1.1.0.Final org.apache.hadoop hadoop-common 2.3.0 Thanks, - Dongming
[jira] [Commented] (APEXMALHAR-2366) Apply BloomFilter to Bucket
[ https://issues.apache.org/jira/browse/APEXMALHAR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886659#comment-15886659 ] bright chen commented on APEXMALHAR-2366: - Hi [~bhupesh] The only difference as I think is this BloomFilter implementation used SerializationBuffer to save some copy and garbage collection. I am not sure how much impact on performance. Another thing is Chaitanya's BloomFilter is in Megh, It at least need to move to the malhar lib before can use it. and I am not sure if there any license issue neither > Apply BloomFilter to Bucket > --- > > Key: APEXMALHAR-2366 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2366 > Project: Apache Apex Malhar > Issue Type: Improvement >Reporter: bright chen >Assignee: bright chen > Original Estimate: 192h > Remaining Estimate: 192h > > The bucket get() will check the cache and then check from the stored files if > the entry is not in the cache. The checking from files is a pretty heavy > operation due to file seek. > The chance of check from file is very high if the key range are large. > Suggest to apply BloomFilter for bucket to reduce the chance read from file. > If the buckets were managed by ManagedStateImpl, the entry of bucket would be > very huge and the BloomFilter maybe not useful after a while. But If the > buckets were managed by ManagedTimeUnifiedStateImpl, each bucket keep certain > amount of entry and BloomFilter would be very useful. > For implementation: > The Guava already have BloomFilter and the interface are pretty simple and > fit for our case. But Guava 11 is not compatible with Guava 14 (Guava 11 use > Sink while Guava 14 use PrimitiveSink). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Java packages: legacy -> org.apache.apex
Let's not be confused with open source == ASF, it is not. Not all open source projects are part of Apache. Majority of Apache projects do use "org.apache." package names. Thank you, Vlad //On 2/27/17 10:24, Sanjay Pujare wrote: +1 for bullet 1 assuming new code implies brand new classes (since it doesn't involve any backward compatibility issues). We can always review contributor PRs to make sure new code is added with new package naming guidelines. But for 2 and 3 I have a question/comment: is there even a need to do it? There is lots of open source code with package names like com.google.* and com.sun.* etc and as far as I know there are no moves afoot to rename these packages. The renaming itself doesn't add any new functionality or technical capabilities but introduces instability in Apex code as well as user code. Just a thought... On Mon, Feb 27, 2017 at 8:23 AM, Chinmay Kolhatkarwrote: Thomas, I agree with you that we need this migration to be done but I have a different opinion on how to execute this. I think if we do this in phases as described above, users might end up in more confusion. For doing this migration, I think it should follow these steps: 1. Whether for operator library or core components, we should announce widely on dev and users mailing list that "...such change is going to happen in next release" 2 Take up the work all at once and not phase it. Thanks, Chinmay. On Mon, Feb 27, 2017 at 9:09 PM, Thomas Weise wrote: Hi, This topic has come up on several PRs and I think it warrants a broader discussion. At the time of incubation, the decision was to defer change of Java packages from com.datatorrent to org.apache.apex till next major release to ensure backward compatibility for users. Unfortunately that has lead to some confusion, as contributors continue to add new code under legacy packages. It is also a wider issue that examples for using Apex continue to refer to com.datatorrent packages, nearly one year after graduation. More and more user code is being built on top of something that needs to change, the can is being kicked down the road and users will face more changes later. I would like to propose the following: 1. All new code has to be submitted under org.apache.apex packages 2. Not all code is under backward compatibility restriction and in those cases we can rename the packages right away. Examples: buffer server, engine, demos/examples, benchmarks 3. Discuss when the core API and operators can be changed. For operators we have a bit more freedom to do changes before a major release as most of them are marked @Evolving and users have the ability to continue using prior version of Malhar with newer engine due to engine backward compatibility guarantee. Thanks, Thomas
Re: Java packages: legacy -> org.apache.apex
For malhar, for existing operators, I prefer we do this as part of the planned refactoring for breaking the monolith modules into baby packages and would also prefer deprecating the existing operators in place. This will help us achieve two things. First, the user will see all the new changes at once as opposed to dealing with it twice (with package rename and dependency changes) and second it will allow for a smoother transition as the existing code will still work in a deprecated state. It will also give a more consistent structure to malhar. For new operators, we can go with the new package path but we need to ensure they will get moved into the baby packages as well. For demos, we can modify the paths as the apps are typically used wholesale and the interface is typically manual interaction. For core, if we are adding new api subsystems, like the launcher api we added recently for example, we can go with new package path but if we are making incremental additions to existing functionality, I feel it is better to keep it in the same package. I also prefer we keep the package of the implementation classes consistent with api, for understandability and readability of the code. So, for example, we don't change package path of LogicalPlan as it is an implementation of DAG. It is subjective, but it will be good if we can also do the same with classes closely related to the implementation classes as well. Maybe we can moving these on a package by package basis, like everything in com.datatorrent.stram.engine could be moved. For completely internal components like buffer server, we can move them wholesale. We can consider moving all api and classes, when we go to next major release but would like to see if we can find a way to support existing api for one more major release in deprecated mode. Thanks On Mon, Feb 27, 2017 at 7:39 AM, Thomas Weisewrote: > Hi, > > This topic has come up on several PRs and I think it warrants a broader > discussion. > > At the time of incubation, the decision was to defer change of Java > packages from com.datatorrent to org.apache.apex till next major release to > ensure backward compatibility for users. > > Unfortunately that has lead to some confusion, as contributors continue to > add new code under legacy packages. > > It is also a wider issue that examples for using Apex continue to refer to > com.datatorrent packages, nearly one year after graduation. More and more > user code is being built on top of something that needs to change, the can > is being kicked down the road and users will face more changes later. > > I would like to propose the following: > > 1. All new code has to be submitted under org.apache.apex packages > > 2. Not all code is under backward compatibility restriction and in those > cases we can rename the packages right away. Examples: buffer server, > engine, demos/examples, benchmarks > > 3. Discuss when the core API and operators can be changed. For operators we > have a bit more freedom to do changes before a major release as most of > them are marked @Evolving and users have the ability to continue using > prior version of Malhar with newer engine due to engine backward > compatibility guarantee. > > Thanks, > Thomas >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
I now see your rationale on putting the filename in the window. As far as I understand, the reasons why the filename is not part of the key and the Global Window is not used are: 1) The files are processed in sequence, not in parallel 2) The windowed operator should not keep the state associated with the file when the processing of the file is done 3) The trigger should be fired for the file when a file is done processing. However, if the file is just a sequence has nothing to do with a timestamp, assigning a timestamp to a file is not an intuitive thing to do and would just create confusions to the users, especially when it's used as an example for new users. How about having a separate class called SequenceWindow? And perhaps TimeWindow can inherit from it? David On Mon, Feb 27, 2017 at 8:58 AM, Thomas Weisewrote: > On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawda > wrote: > > > I think my comments related to count based windows might be causing > > confusion. Let's not discuss count based scenarios for now. > > > > Just want to make sure we are on the same page wrt. the "each file is a > > batch" use case. As mentioned by Thomas, the each tuple from the same > file > > has the same timestamp (which is just a sequence number) and that helps > > keep tuples from each file in a separate window. > > > > Yes, in this case it is a sequence number, but it could be a time stamp > also, depending on the file naming convention. And if it was event time > processing, the watermark would be derived from records within the file. > > Agreed, the source should have a mechanism to control the time stamp > extraction along with everything else pertaining to the watermark > generation. > > > > We could also implement a "timestampExtractor" interface to identify the > > timestamp (sequence number) for a file. > > > > ~ Bhupesh > > > > > > ___ > > > > Bhupesh Chawda > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > www.datatorrent.com | apex.apache.org > > > > > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise wrote: > > > > > I don't think this is a use case for count based window. > > > > > > We have multiple files that are retrieved in a sequence and there is no > > > knowledge of the number of records per file. The requirement is to > > > aggregate each file separately and emit the aggregate when the file is > > read > > > fully. There is no concept of "end of something" for an individual key > > and > > > global window isn't applicable. > > > > > > However, as already explained and implemented by Bhupesh, this can be > > > solved using watermark and window (in this case the window timestamp > > isn't > > > a timestamp, but a file sequence, but that doesn't matter. > > > > > > Thomas > > > > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan wrote: > > > > > > > I don't think this is the way to go. Global Window only means the > > > timestamp > > > > does not matter (or that there is no timestamp). It does not > > necessarily > > > > mean it's a large batch. Unless there is some notion of event time > for > > > each > > > > file, you don't want to embed the file into the window itself. > > > > > > > > If you want the result broken up by file name, and if the files are > to > > be > > > > processed in parallel, I think making the file name be part of the > key > > is > > > > the way to go. I think it's very confusing if we somehow make the > file > > to > > > > be part of the window. > > > > > > > > For count-based window, it's not implemented yet and you're welcome > to > > > add > > > > that feature. In case of count-based windows, there would be no > notion > > of > > > > time and you probably only trigger at the end of each window. In the > > case > > > > of count-based windows, the watermark only matters for batch since > you > > > need > > > > a way to know when the batch has ended (if the count is 10, the > number > > of > > > > tuples in the batch is let's say 105, you need a way to end the last > > > window > > > > with 5 tuples). > > > > > > > > David > > > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda < > > bhup...@datatorrent.com > > > > > > > > wrote: > > > > > > > > > Hi David, > > > > > > > > > > Thanks for your comments. > > > > > > > > > > The wordcount example that I created based on the windowed operator > > > does > > > > > processing of word counts per file (each file as a separate batch), > > > i.e. > > > > > process counts for each file and dump into separate files. > > > > > As I understand Global window is for one large batch; i.e. all > > incoming > > > > > data falls into the same batch. This could not be processed using > > > > > GlobalWindow option as we need more than one windows. In this > case, I > > > > > configured the windowed operator to have time windows of 1ms each > and > > > > > passed data for each file
[GitHub] apex-malhar pull request #566: APEXMALHAR-2428 CompositeAccumulation for win...
GitHub user brightchen opened a pull request: https://github.com/apache/apex-malhar/pull/566 APEXMALHAR-2428 CompositeAccumulation for windowed operator You can merge this pull request into a Git repository by running: $ git pull https://github.com/brightchen/apex-malhar APEXMALHAR-2428 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/566.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #566 commit e2b1d31e5e69c0515db68baccf1fb97f8f7077c0 Author: brightchenDate: 2017-01-27T20:37:07Z APEXMALHAR-2428 CompositeAccumulation for windowed operator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2428) CompositeAccumulation for windowed operator
[ https://issues.apache.org/jira/browse/APEXMALHAR-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886304#comment-15886304 ] ASF GitHub Bot commented on APEXMALHAR-2428: GitHub user brightchen opened a pull request: https://github.com/apache/apex-malhar/pull/566 APEXMALHAR-2428 CompositeAccumulation for windowed operator You can merge this pull request into a Git repository by running: $ git pull https://github.com/brightchen/apex-malhar APEXMALHAR-2428 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/566.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #566 commit e2b1d31e5e69c0515db68baccf1fb97f8f7077c0 Author: brightchenDate: 2017-01-27T20:37:07Z APEXMALHAR-2428 CompositeAccumulation for windowed operator > CompositeAccumulation for windowed operator > --- > > Key: APEXMALHAR-2428 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2428 > Project: Apache Apex Malhar > Issue Type: New Feature >Reporter: bright chen >Assignee: bright chen > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (APEXMALHAR-2395) create MultiAccumulation
[ https://issues.apache.org/jira/browse/APEXMALHAR-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bright chen resolved APEXMALHAR-2395. - Resolution: Duplicate duplicate with https://issues.apache.org/jira/browse/APEXMALHAR-2428 > create MultiAccumulation > > > Key: APEXMALHAR-2395 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2395 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: bright chen >Assignee: bright chen > > create MultiAccumulation which support SUM, MAX, MIN, COUNT. And AVG can be > computed by SUM and COUNT when query. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Java packages: legacy -> org.apache.apex
+1 for bullet 1 assuming new code implies brand new classes (since it doesn't involve any backward compatibility issues). We can always review contributor PRs to make sure new code is added with new package naming guidelines. But for 2 and 3 I have a question/comment: is there even a need to do it? There is lots of open source code with package names like com.google.* and com.sun.* etc and as far as I know there are no moves afoot to rename these packages. The renaming itself doesn't add any new functionality or technical capabilities but introduces instability in Apex code as well as user code. Just a thought... On Mon, Feb 27, 2017 at 8:23 AM, Chinmay Kolhatkarwrote: > Thomas, > > I agree with you that we need this migration to be done but I have a > different opinion on how to execute this. > I think if we do this in phases as described above, users might end up in > more confusion. > > For doing this migration, I think it should follow these steps: > 1. Whether for operator library or core components, we should announce > widely on dev and users mailing list that "...such change is going to > happen in next release" > 2 Take up the work all at once and not phase it. > > Thanks, > Chinmay. > > > > On Mon, Feb 27, 2017 at 9:09 PM, Thomas Weise wrote: > > > Hi, > > > > This topic has come up on several PRs and I think it warrants a broader > > discussion. > > > > At the time of incubation, the decision was to defer change of Java > > packages from com.datatorrent to org.apache.apex till next major release > to > > ensure backward compatibility for users. > > > > Unfortunately that has lead to some confusion, as contributors continue > to > > add new code under legacy packages. > > > > It is also a wider issue that examples for using Apex continue to refer > to > > com.datatorrent packages, nearly one year after graduation. More and more > > user code is being built on top of something that needs to change, the > can > > is being kicked down the road and users will face more changes later. > > > > I would like to propose the following: > > > > 1. All new code has to be submitted under org.apache.apex packages > > > > 2. Not all code is under backward compatibility restriction and in those > > cases we can rename the packages right away. Examples: buffer server, > > engine, demos/examples, benchmarks > > > > 3. Discuss when the core API and operators can be changed. For operators > we > > have a bit more freedom to do changes before a major release as most of > > them are marked @Evolving and users have the ability to continue using > > prior version of Malhar with newer engine due to engine backward > > compatibility guarantee. > > > > Thanks, > > Thomas > > >
Re: Proposal: CompositeAccumulation for Windowed Operator
A jira created: https://issues.apache.org/jira/browse/APEXMALHAR-2428 On Mon, Feb 27, 2017 at 9:53 AM, Bright Chenwrote: > I think Chimay's proposal could make application more clear and increase > the performance as locate of key/window cost most of time. > > A suggested usage for Composite Accumulation could as following: > > *//following is the sample code how to add sub accumulations* > > *CompositeAccumulation accumulations = new > CompositeAccumulation<>();* > > *AccumulationTag sumTag = > accumulations.addAccumulation((Accumulation)new SumAccumulation());* > > *AccumulationTag countTag = > accumulations.addAccumulation((Accumulation)new Count());* > > *AccumulationTag maxTag = accumulations.addAccumulation(new Max());* > > *AccumulationTag minTag = accumulations.addAccumulation(new Min());* > > *//following is the sample how to get the sub-accumulation output* > > *accumulations.getSubOutput(sumTag, outputValues)* > > *accumulations.getSubOutput(countTag, outputValues)* > > *accumulations.getSubOutput(maxTag, outputValues)* > > *accumulations.getSubOutput(minTag, outputValues)* > > > Thanks > > Bright > > On Sun, Feb 26, 2017 at 10:33 PM, Chinmay Kolhatkar < > chin...@datatorrent.com> wrote: > >> Dear Community, >> >> Currently we have accumulations for individual types of accumulations. >> But if one wants to do more than one accumulations in a single stage of >> Windowed Operator it is not possible. >> >> I want to propose an idea about "CompositeAccumulation" where more than >> one >> accumulation can be configured and this accoumulation can relay on >> multiple >> accumulations to generate final result/output. >> >> The output can be either of the 2 forms: >> 1. Just the list of outputs with AccumulationTags as identifiers. >> 2. Merge the results of multiple accumulations using some user defined >> logic. >> For eg. In aggregation case, Input POJO to this accumulation can be a >> POJO containing NumberOfOrders as field and in output one might need to >> generate a final(single) POJO which contains result of multiple >> accumulations like SUM, COUNT on NumberOfOrders as different fields of >> outgoing POJO. >> >> I particularly see the use of this for Multiple Aggregation which we would >> like to do in SQL on Apex Integration. >> >> Please share your thoughts on the same. >> >> Thanks, >> Chinmay. >> > >
[jira] [Created] (APEXMALHAR-2428) CompositeAccumulation for windowed operator
bright chen created APEXMALHAR-2428: --- Summary: CompositeAccumulation for windowed operator Key: APEXMALHAR-2428 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2428 Project: Apache Apex Malhar Issue Type: New Feature Reporter: bright chen Assignee: bright chen -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Proposal: CompositeAccumulation for Windowed Operator
I think Chimay's proposal could make application more clear and increase the performance as locate of key/window cost most of time. A suggested usage for Composite Accumulation could as following: *//following is the sample code how to add sub accumulations* *CompositeAccumulation accumulations = new CompositeAccumulation<>();* *AccumulationTag sumTag = accumulations.addAccumulation((Accumulation)new SumAccumulation());* *AccumulationTag countTag = accumulations.addAccumulation((Accumulation)new Count());* *AccumulationTag maxTag = accumulations.addAccumulation(new Max());* *AccumulationTag minTag = accumulations.addAccumulation(new Min());* *//following is the sample how to get the sub-accumulation output* *accumulations.getSubOutput(sumTag, outputValues)* *accumulations.getSubOutput(countTag, outputValues)* *accumulations.getSubOutput(maxTag, outputValues)* *accumulations.getSubOutput(minTag, outputValues)* Thanks Bright On Sun, Feb 26, 2017 at 10:33 PM, Chinmay Kolhatkarwrote: > Dear Community, > > Currently we have accumulations for individual types of accumulations. > But if one wants to do more than one accumulations in a single stage of > Windowed Operator it is not possible. > > I want to propose an idea about "CompositeAccumulation" where more than one > accumulation can be configured and this accoumulation can relay on multiple > accumulations to generate final result/output. > > The output can be either of the 2 forms: > 1. Just the list of outputs with AccumulationTags as identifiers. > 2. Merge the results of multiple accumulations using some user defined > logic. > For eg. In aggregation case, Input POJO to this accumulation can be a > POJO containing NumberOfOrders as field and in output one might need to > generate a final(single) POJO which contains result of multiple > accumulations like SUM, COUNT on NumberOfOrders as different fields of > outgoing POJO. > > I particularly see the use of this for Multiple Aggregation which we would > like to do in SQL on Apex Integration. > > Please share your thoughts on the same. > > Thanks, > Chinmay. >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
On Mon, Feb 27, 2017 at 8:50 AM, Bhupesh Chawdawrote: > I think my comments related to count based windows might be causing > confusion. Let's not discuss count based scenarios for now. > > Just want to make sure we are on the same page wrt. the "each file is a > batch" use case. As mentioned by Thomas, the each tuple from the same file > has the same timestamp (which is just a sequence number) and that helps > keep tuples from each file in a separate window. > Yes, in this case it is a sequence number, but it could be a time stamp also, depending on the file naming convention. And if it was event time processing, the watermark would be derived from records within the file. Agreed, the source should have a mechanism to control the time stamp extraction along with everything else pertaining to the watermark generation. > We could also implement a "timestampExtractor" interface to identify the > timestamp (sequence number) for a file. > > ~ Bhupesh > > > ___ > > Bhupesh Chawda > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > www.datatorrent.com | apex.apache.org > > > > On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weise wrote: > > > I don't think this is a use case for count based window. > > > > We have multiple files that are retrieved in a sequence and there is no > > knowledge of the number of records per file. The requirement is to > > aggregate each file separately and emit the aggregate when the file is > read > > fully. There is no concept of "end of something" for an individual key > and > > global window isn't applicable. > > > > However, as already explained and implemented by Bhupesh, this can be > > solved using watermark and window (in this case the window timestamp > isn't > > a timestamp, but a file sequence, but that doesn't matter. > > > > Thomas > > > > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan wrote: > > > > > I don't think this is the way to go. Global Window only means the > > timestamp > > > does not matter (or that there is no timestamp). It does not > necessarily > > > mean it's a large batch. Unless there is some notion of event time for > > each > > > file, you don't want to embed the file into the window itself. > > > > > > If you want the result broken up by file name, and if the files are to > be > > > processed in parallel, I think making the file name be part of the key > is > > > the way to go. I think it's very confusing if we somehow make the file > to > > > be part of the window. > > > > > > For count-based window, it's not implemented yet and you're welcome to > > add > > > that feature. In case of count-based windows, there would be no notion > of > > > time and you probably only trigger at the end of each window. In the > case > > > of count-based windows, the watermark only matters for batch since you > > need > > > a way to know when the batch has ended (if the count is 10, the number > of > > > tuples in the batch is let's say 105, you need a way to end the last > > window > > > with 5 tuples). > > > > > > David > > > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda < > bhup...@datatorrent.com > > > > > > wrote: > > > > > > > Hi David, > > > > > > > > Thanks for your comments. > > > > > > > > The wordcount example that I created based on the windowed operator > > does > > > > processing of word counts per file (each file as a separate batch), > > i.e. > > > > process counts for each file and dump into separate files. > > > > As I understand Global window is for one large batch; i.e. all > incoming > > > > data falls into the same batch. This could not be processed using > > > > GlobalWindow option as we need more than one windows. In this case, I > > > > configured the windowed operator to have time windows of 1ms each and > > > > passed data for each file with increasing timestamps: (file1, 1), > > (file2, > > > > 2) and so on. Is there a better way of handling this scenario? > > > > > > > > Regarding (2 - count based windows), I think there is a trigger > option > > to > > > > process count based windows. In case I want to process every 1000 > > tuples > > > as > > > > a batch, I could set the Trigger option to CountTrigger with the > > > > accumulation set to Discarding. Is this correct? > > > > > > > > I agree that (4. Final Watermark) can be done using Global window. > > > > > > > > ~ Bhupesh > > > > > > > > ___ > > > > > > > > Bhupesh Chawda > > > > > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > > > > > www.datatorrent.com | apex.apache.org > > > > > > > > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan > > wrote: > > > > > > > > > I'm worried that we are making the watermark concept too > complicated. > > > > > > > > > > Watermarks should simply just tell you what windows can be > considered > > > > > complete. > > > >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
I think my comments related to count based windows might be causing confusion. Let's not discuss count based scenarios for now. Just want to make sure we are on the same page wrt. the "each file is a batch" use case. As mentioned by Thomas, the each tuple from the same file has the same timestamp (which is just a sequence number) and that helps keep tuples from each file in a separate window. We could also implement a "timestampExtractor" interface to identify the timestamp (sequence number) for a file. ~ Bhupesh ___ Bhupesh Chawda E: bhup...@datatorrent.com | Twitter: @bhupeshsc www.datatorrent.com | apex.apache.org On Mon, Feb 27, 2017 at 9:52 PM, Thomas Weisewrote: > I don't think this is a use case for count based window. > > We have multiple files that are retrieved in a sequence and there is no > knowledge of the number of records per file. The requirement is to > aggregate each file separately and emit the aggregate when the file is read > fully. There is no concept of "end of something" for an individual key and > global window isn't applicable. > > However, as already explained and implemented by Bhupesh, this can be > solved using watermark and window (in this case the window timestamp isn't > a timestamp, but a file sequence, but that doesn't matter. > > Thomas > > > On Mon, Feb 27, 2017 at 8:05 AM, David Yan wrote: > > > I don't think this is the way to go. Global Window only means the > timestamp > > does not matter (or that there is no timestamp). It does not necessarily > > mean it's a large batch. Unless there is some notion of event time for > each > > file, you don't want to embed the file into the window itself. > > > > If you want the result broken up by file name, and if the files are to be > > processed in parallel, I think making the file name be part of the key is > > the way to go. I think it's very confusing if we somehow make the file to > > be part of the window. > > > > For count-based window, it's not implemented yet and you're welcome to > add > > that feature. In case of count-based windows, there would be no notion of > > time and you probably only trigger at the end of each window. In the case > > of count-based windows, the watermark only matters for batch since you > need > > a way to know when the batch has ended (if the count is 10, the number of > > tuples in the batch is let's say 105, you need a way to end the last > window > > with 5 tuples). > > > > David > > > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda > > > wrote: > > > > > Hi David, > > > > > > Thanks for your comments. > > > > > > The wordcount example that I created based on the windowed operator > does > > > processing of word counts per file (each file as a separate batch), > i.e. > > > process counts for each file and dump into separate files. > > > As I understand Global window is for one large batch; i.e. all incoming > > > data falls into the same batch. This could not be processed using > > > GlobalWindow option as we need more than one windows. In this case, I > > > configured the windowed operator to have time windows of 1ms each and > > > passed data for each file with increasing timestamps: (file1, 1), > (file2, > > > 2) and so on. Is there a better way of handling this scenario? > > > > > > Regarding (2 - count based windows), I think there is a trigger option > to > > > process count based windows. In case I want to process every 1000 > tuples > > as > > > a batch, I could set the Trigger option to CountTrigger with the > > > accumulation set to Discarding. Is this correct? > > > > > > I agree that (4. Final Watermark) can be done using Global window. > > > > > > ~ Bhupesh > > > > > > ___ > > > > > > Bhupesh Chawda > > > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > > > www.datatorrent.com | apex.apache.org > > > > > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan > wrote: > > > > > > > I'm worried that we are making the watermark concept too complicated. > > > > > > > > Watermarks should simply just tell you what windows can be considered > > > > complete. > > > > > > > > Point 2 is basically a count-based window. Watermarks do not play a > > role > > > > here because the window is always complete at the n-th tuple. > > > > > > > > If I understand correctly, point 3 is for batch processing of files. > > > Unless > > > > the files contain timed events, it sounds to be that this can be > > achieved > > > > with just a Global Window. For signaling EOF, a watermark with a > > > +infinity > > > > timestamp can be used so that triggers will be fired upon receipt of > > that > > > > watermark. > > > > > > > > For point 4, just like what I mentioned above, can be achieved with a > > > > watermark with a +infinity timestamp. > > > > > > > > David > > > > > > > > >
Re: Java packages: legacy -> org.apache.apex
Thomas, I agree with you that we need this migration to be done but I have a different opinion on how to execute this. I think if we do this in phases as described above, users might end up in more confusion. For doing this migration, I think it should follow these steps: 1. Whether for operator library or core components, we should announce widely on dev and users mailing list that "...such change is going to happen in next release" 2 Take up the work all at once and not phase it. Thanks, Chinmay. On Mon, Feb 27, 2017 at 9:09 PM, Thomas Weisewrote: > Hi, > > This topic has come up on several PRs and I think it warrants a broader > discussion. > > At the time of incubation, the decision was to defer change of Java > packages from com.datatorrent to org.apache.apex till next major release to > ensure backward compatibility for users. > > Unfortunately that has lead to some confusion, as contributors continue to > add new code under legacy packages. > > It is also a wider issue that examples for using Apex continue to refer to > com.datatorrent packages, nearly one year after graduation. More and more > user code is being built on top of something that needs to change, the can > is being kicked down the road and users will face more changes later. > > I would like to propose the following: > > 1. All new code has to be submitted under org.apache.apex packages > > 2. Not all code is under backward compatibility restriction and in those > cases we can rename the packages right away. Examples: buffer server, > engine, demos/examples, benchmarks > > 3. Discuss when the core API and operators can be changed. For operators we > have a bit more freedom to do changes before a major release as most of > them are marked @Evolving and users have the ability to continue using > prior version of Malhar with newer engine due to engine backward > compatibility guarantee. > > Thanks, > Thomas >
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
I don't think this is a use case for count based window. We have multiple files that are retrieved in a sequence and there is no knowledge of the number of records per file. The requirement is to aggregate each file separately and emit the aggregate when the file is read fully. There is no concept of "end of something" for an individual key and global window isn't applicable. However, as already explained and implemented by Bhupesh, this can be solved using watermark and window (in this case the window timestamp isn't a timestamp, but a file sequence, but that doesn't matter. Thomas On Mon, Feb 27, 2017 at 8:05 AM, David Yanwrote: > I don't think this is the way to go. Global Window only means the timestamp > does not matter (or that there is no timestamp). It does not necessarily > mean it's a large batch. Unless there is some notion of event time for each > file, you don't want to embed the file into the window itself. > > If you want the result broken up by file name, and if the files are to be > processed in parallel, I think making the file name be part of the key is > the way to go. I think it's very confusing if we somehow make the file to > be part of the window. > > For count-based window, it's not implemented yet and you're welcome to add > that feature. In case of count-based windows, there would be no notion of > time and you probably only trigger at the end of each window. In the case > of count-based windows, the watermark only matters for batch since you need > a way to know when the batch has ended (if the count is 10, the number of > tuples in the batch is let's say 105, you need a way to end the last window > with 5 tuples). > > David > > On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawda > wrote: > > > Hi David, > > > > Thanks for your comments. > > > > The wordcount example that I created based on the windowed operator does > > processing of word counts per file (each file as a separate batch), i.e. > > process counts for each file and dump into separate files. > > As I understand Global window is for one large batch; i.e. all incoming > > data falls into the same batch. This could not be processed using > > GlobalWindow option as we need more than one windows. In this case, I > > configured the windowed operator to have time windows of 1ms each and > > passed data for each file with increasing timestamps: (file1, 1), (file2, > > 2) and so on. Is there a better way of handling this scenario? > > > > Regarding (2 - count based windows), I think there is a trigger option to > > process count based windows. In case I want to process every 1000 tuples > as > > a batch, I could set the Trigger option to CountTrigger with the > > accumulation set to Discarding. Is this correct? > > > > I agree that (4. Final Watermark) can be done using Global window. > > > > ~ Bhupesh > > > > ___ > > > > Bhupesh Chawda > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > www.datatorrent.com | apex.apache.org > > > > > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan wrote: > > > > > I'm worried that we are making the watermark concept too complicated. > > > > > > Watermarks should simply just tell you what windows can be considered > > > complete. > > > > > > Point 2 is basically a count-based window. Watermarks do not play a > role > > > here because the window is always complete at the n-th tuple. > > > > > > If I understand correctly, point 3 is for batch processing of files. > > Unless > > > the files contain timed events, it sounds to be that this can be > achieved > > > with just a Global Window. For signaling EOF, a watermark with a > > +infinity > > > timestamp can be used so that triggers will be fired upon receipt of > that > > > watermark. > > > > > > For point 4, just like what I mentioned above, can be achieved with a > > > watermark with a +infinity timestamp. > > > > > > David > > > > > > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda < > bhup...@datatorrent.com > > > > > > wrote: > > > > > > > Hi Thomas, > > > > > > > > For an input operator which is supposed to generate watermarks for > > > > downstream operators, I can think about the following watermarks that > > the > > > > operator can emit: > > > > 1. Time based watermarks (the high watermark / low watermark) > > > > 2. Number of tuple based watermarks (Every n tuples) > > > > 3. File based watermarks (Start file, end file) > > > > 4. Final watermark > > > > > > > > File based watermarks seem to be applicable for batch (file based) as > > > well, > > > > and hence I thought of looking at these first. Does this seem to be > in > > > line > > > > with the thought process? > > > > > > > > ~ Bhupesh > > > > > > > > > > > > > > > > ___ > > > > > > > > Bhupesh Chawda > > > > > > > > Software Engineer > > > > > > > > E:
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
I don't think this is the way to go. Global Window only means the timestamp does not matter (or that there is no timestamp). It does not necessarily mean it's a large batch. Unless there is some notion of event time for each file, you don't want to embed the file into the window itself. If you want the result broken up by file name, and if the files are to be processed in parallel, I think making the file name be part of the key is the way to go. I think it's very confusing if we somehow make the file to be part of the window. For count-based window, it's not implemented yet and you're welcome to add that feature. In case of count-based windows, there would be no notion of time and you probably only trigger at the end of each window. In the case of count-based windows, the watermark only matters for batch since you need a way to know when the batch has ended (if the count is 10, the number of tuples in the batch is let's say 105, you need a way to end the last window with 5 tuples). David On Mon, Feb 27, 2017 at 2:41 AM, Bhupesh Chawdawrote: > Hi David, > > Thanks for your comments. > > The wordcount example that I created based on the windowed operator does > processing of word counts per file (each file as a separate batch), i.e. > process counts for each file and dump into separate files. > As I understand Global window is for one large batch; i.e. all incoming > data falls into the same batch. This could not be processed using > GlobalWindow option as we need more than one windows. In this case, I > configured the windowed operator to have time windows of 1ms each and > passed data for each file with increasing timestamps: (file1, 1), (file2, > 2) and so on. Is there a better way of handling this scenario? > > Regarding (2 - count based windows), I think there is a trigger option to > process count based windows. In case I want to process every 1000 tuples as > a batch, I could set the Trigger option to CountTrigger with the > accumulation set to Discarding. Is this correct? > > I agree that (4. Final Watermark) can be done using Global window. > > ~ Bhupesh > > ___ > > Bhupesh Chawda > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > www.datatorrent.com | apex.apache.org > > > > On Mon, Feb 27, 2017 at 12:18 PM, David Yan wrote: > > > I'm worried that we are making the watermark concept too complicated. > > > > Watermarks should simply just tell you what windows can be considered > > complete. > > > > Point 2 is basically a count-based window. Watermarks do not play a role > > here because the window is always complete at the n-th tuple. > > > > If I understand correctly, point 3 is for batch processing of files. > Unless > > the files contain timed events, it sounds to be that this can be achieved > > with just a Global Window. For signaling EOF, a watermark with a > +infinity > > timestamp can be used so that triggers will be fired upon receipt of that > > watermark. > > > > For point 4, just like what I mentioned above, can be achieved with a > > watermark with a +infinity timestamp. > > > > David > > > > > > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda > > > wrote: > > > > > Hi Thomas, > > > > > > For an input operator which is supposed to generate watermarks for > > > downstream operators, I can think about the following watermarks that > the > > > operator can emit: > > > 1. Time based watermarks (the high watermark / low watermark) > > > 2. Number of tuple based watermarks (Every n tuples) > > > 3. File based watermarks (Start file, end file) > > > 4. Final watermark > > > > > > File based watermarks seem to be applicable for batch (file based) as > > well, > > > and hence I thought of looking at these first. Does this seem to be in > > line > > > with the thought process? > > > > > > ~ Bhupesh > > > > > > > > > > > > ___ > > > > > > Bhupesh Chawda > > > > > > Software Engineer > > > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > > > www.datatorrent.com | apex.apache.org > > > > > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise wrote: > > > > > > > I don't think this should be designed based on a simplistic file > > > > input-output scenario. It would be good to include a stateful > > > > transformation based on event time. > > > > > > > > More complex pipelines contain stateful transformations that depend > on > > > > windowing and watermarks. I think we need a watermark concept that is > > > based > > > > on progress in event time (or other monotonic increasing sequence) > that > > > > other operators can generically work with. > > > > > > > > Note that even file input in many cases can produce time based > > > watermarks, > > > > for example when you read part files that are bound by event time. > > > > > > > > Thanks, > > > > Thomas > > > >
Java packages: legacy -> org.apache.apex
Hi, This topic has come up on several PRs and I think it warrants a broader discussion. At the time of incubation, the decision was to defer change of Java packages from com.datatorrent to org.apache.apex till next major release to ensure backward compatibility for users. Unfortunately that has lead to some confusion, as contributors continue to add new code under legacy packages. It is also a wider issue that examples for using Apex continue to refer to com.datatorrent packages, nearly one year after graduation. More and more user code is being built on top of something that needs to change, the can is being kicked down the road and users will face more changes later. I would like to propose the following: 1. All new code has to be submitted under org.apache.apex packages 2. Not all code is under backward compatibility restriction and in those cases we can rename the packages right away. Examples: buffer server, engine, demos/examples, benchmarks 3. Discuss when the core API and operators can be changed. For operators we have a bit more freedom to do changes before a major release as most of them are marked @Evolving and users have the ability to continue using prior version of Malhar with newer engine due to engine backward compatibility guarantee. Thanks, Thomas
[jira] [Resolved] (APEXMALHAR-2424) Null pointer exception in JDBCPojoPollInputOperator is thrown when we set columnsExpression and have additional columns in fieldInfos
[ https://issues.apache.org/jira/browse/APEXMALHAR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shubham pathak resolved APEXMALHAR-2424. Resolution: Fixed > Null pointer exception in JDBCPojoPollInputOperator is thrown when we set > columnsExpression and have additional columns in fieldInfos > - > > Key: APEXMALHAR-2424 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2424 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Hitesh Kapoor >Assignee: Hitesh Kapoor > Fix For: 3.7.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (APEXMALHAR-2424) Null pointer exception in JDBCPojoPollInputOperator is thrown when we set columnsExpression and have additional columns in fieldInfos
[ https://issues.apache.org/jira/browse/APEXMALHAR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shubham pathak updated APEXMALHAR-2424: --- Fix Version/s: 3.7.0 > Null pointer exception in JDBCPojoPollInputOperator is thrown when we set > columnsExpression and have additional columns in fieldInfos > - > > Key: APEXMALHAR-2424 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2424 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Hitesh Kapoor >Assignee: Hitesh Kapoor > Fix For: 3.7.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2424) Null pointer exception in JDBCPojoPollInputOperator is thrown when we set columnsExpression and have additional columns in fieldInfos
[ https://issues.apache.org/jira/browse/APEXMALHAR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885722#comment-15885722 ] ASF GitHub Bot commented on APEXMALHAR-2424: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/562 > Null pointer exception in JDBCPojoPollInputOperator is thrown when we set > columnsExpression and have additional columns in fieldInfos > - > > Key: APEXMALHAR-2424 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2424 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Hitesh Kapoor >Assignee: Hitesh Kapoor > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] apex-malhar pull request #562: APEXMALHAR-2424 Extra null field getting adde...
Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/562 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (APEXMALHAR-2415) Enable PojoInnerJoin accum to allow multiple keys for join purpose
[ https://issues.apache.org/jira/browse/APEXMALHAR-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tushar Gosavi resolved APEXMALHAR-2415. --- Resolution: Fixed Fix Version/s: 3.7.0 > Enable PojoInnerJoin accum to allow multiple keys for join purpose > -- > > Key: APEXMALHAR-2415 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2415 > Project: Apache Apex Malhar > Issue Type: Sub-task > Components: algorithms >Affects Versions: 3.6.0 >Reporter: Chinmay Kolhatkar >Assignee: Hitesh Kapoor > Fix For: 3.7.0 > > > Enable PojoInnerJoin accum to allow multiple keys for join purpose -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2415) Enable PojoInnerJoin accum to allow multiple keys for join purpose
[ https://issues.apache.org/jira/browse/APEXMALHAR-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885663#comment-15885663 ] ASF GitHub Bot commented on APEXMALHAR-2415: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/561 > Enable PojoInnerJoin accum to allow multiple keys for join purpose > -- > > Key: APEXMALHAR-2415 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2415 > Project: Apache Apex Malhar > Issue Type: Sub-task > Components: algorithms >Affects Versions: 3.6.0 >Reporter: Chinmay Kolhatkar >Assignee: Hitesh Kapoor > > Enable PojoInnerJoin accum to allow multiple keys for join purpose -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] apex-malhar pull request #561: APEXMALHAR-2415 Taking join on multiple colum...
Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/561 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
Hi David, Thanks for your comments. The wordcount example that I created based on the windowed operator does processing of word counts per file (each file as a separate batch), i.e. process counts for each file and dump into separate files. As I understand Global window is for one large batch; i.e. all incoming data falls into the same batch. This could not be processed using GlobalWindow option as we need more than one windows. In this case, I configured the windowed operator to have time windows of 1ms each and passed data for each file with increasing timestamps: (file1, 1), (file2, 2) and so on. Is there a better way of handling this scenario? Regarding (2 - count based windows), I think there is a trigger option to process count based windows. In case I want to process every 1000 tuples as a batch, I could set the Trigger option to CountTrigger with the accumulation set to Discarding. Is this correct? I agree that (4. Final Watermark) can be done using Global window. ~ Bhupesh ___ Bhupesh Chawda E: bhup...@datatorrent.com | Twitter: @bhupeshsc www.datatorrent.com | apex.apache.org On Mon, Feb 27, 2017 at 12:18 PM, David Yanwrote: > I'm worried that we are making the watermark concept too complicated. > > Watermarks should simply just tell you what windows can be considered > complete. > > Point 2 is basically a count-based window. Watermarks do not play a role > here because the window is always complete at the n-th tuple. > > If I understand correctly, point 3 is for batch processing of files. Unless > the files contain timed events, it sounds to be that this can be achieved > with just a Global Window. For signaling EOF, a watermark with a +infinity > timestamp can be used so that triggers will be fired upon receipt of that > watermark. > > For point 4, just like what I mentioned above, can be achieved with a > watermark with a +infinity timestamp. > > David > > > > > On Sat, Feb 18, 2017 at 8:04 AM, Bhupesh Chawda > wrote: > > > Hi Thomas, > > > > For an input operator which is supposed to generate watermarks for > > downstream operators, I can think about the following watermarks that the > > operator can emit: > > 1. Time based watermarks (the high watermark / low watermark) > > 2. Number of tuple based watermarks (Every n tuples) > > 3. File based watermarks (Start file, end file) > > 4. Final watermark > > > > File based watermarks seem to be applicable for batch (file based) as > well, > > and hence I thought of looking at these first. Does this seem to be in > line > > with the thought process? > > > > ~ Bhupesh > > > > > > > > ___ > > > > Bhupesh Chawda > > > > Software Engineer > > > > E: bhup...@datatorrent.com | Twitter: @bhupeshsc > > > > www.datatorrent.com | apex.apache.org > > > > > > > > On Thu, Feb 16, 2017 at 10:37 AM, Thomas Weise wrote: > > > > > I don't think this should be designed based on a simplistic file > > > input-output scenario. It would be good to include a stateful > > > transformation based on event time. > > > > > > More complex pipelines contain stateful transformations that depend on > > > windowing and watermarks. I think we need a watermark concept that is > > based > > > on progress in event time (or other monotonic increasing sequence) that > > > other operators can generically work with. > > > > > > Note that even file input in many cases can produce time based > > watermarks, > > > for example when you read part files that are bound by event time. > > > > > > Thanks, > > > Thomas > > > > > > > > > On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda < > bhup...@datatorrent.com > > > > > > wrote: > > > > > > > For better understanding the use case for control tuples in batch, I > > am > > > > creating a prototype for a batch application using File Input and > File > > > > Output operators. > > > > > > > > To enable basic batch processing for File IO operators, I am > proposing > > > the > > > > following changes to File input and output operators: > > > > 1. File Input operator emits a watermark each time it opens and > closes > > a > > > > file. These can be "start file" and "end file" watermarks which > include > > > the > > > > corresponding file names. The "start file" tuple should be sent > before > > > any > > > > of the data from that file flows. > > > > 2. File Input operator can be configured to end the application > after a > > > > single or n scans of the directory (a batch). This is where the > > operator > > > > emits the final watermark (the end of application control tuple). > This > > > will > > > > also shutdown the application. > > > > 3. The File output operator handles these control tuples. "Start > file" > > > > initializes the file name for the incoming tuples. "End file" > watermark > > > > forces a finalize on that file. > > > > > > > > The user would
[jira] [Resolved] (APEXMALHAR-2414) Improve performance of PojoInnerJoin accum by using PojoUtils
[ https://issues.apache.org/jira/browse/APEXMALHAR-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tushar Gosavi resolved APEXMALHAR-2414. --- Resolution: Fixed Fix Version/s: 3.7.0 Changed pushed through dd5341f222b8141b76ecbc5ea9653c70b3c78c44 > Improve performance of PojoInnerJoin accum by using PojoUtils > - > > Key: APEXMALHAR-2414 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2414 > Project: Apache Apex Malhar > Issue Type: Sub-task > Components: algorithms >Reporter: Chinmay Kolhatkar >Assignee: Hitesh Kapoor > Fix For: 3.7.0 > > > Currently PojoInnerJoin accumulation uses java reflection to deal with > objects. Use PojoUtils instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2426) Add user document for RegexParser operator
[ https://issues.apache.org/jira/browse/APEXMALHAR-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885466#comment-15885466 ] ASF GitHub Bot commented on APEXMALHAR-2426: GitHub user venkateshkottapalli opened a pull request: https://github.com/apache/apex-malhar/pull/565 APEXMALHAR-2426 - RegexParser Documentation APEXMALHAR-2426 : Regex Parser Documentation You can merge this pull request into a Git repository by running: $ git pull https://github.com/venkateshkottapalli/apex-malhar APEXMALHAR-2426-RegexParserDocumentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #565 commit 69adc9a7bef8c15258bd4df8b7e39d5f9871d6cc Author: venkateshDTDate: 2017-02-25T00:19:28Z RegexParser Documentation > Add user document for RegexParser operator > -- > > Key: APEXMALHAR-2426 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2426 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Labels: documentation > Original Estimate: 48h > Remaining Estimate: 48h > > Add documentation for the RegexParser operator > RegexParser JIRA : > https://issues.apache.org/jira/browse/APEXMALHAR-2218 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] apex-malhar pull request #565: APEXMALHAR-2426 - RegexParser Documentation
GitHub user venkateshkottapalli opened a pull request: https://github.com/apache/apex-malhar/pull/565 APEXMALHAR-2426 - RegexParser Documentation APEXMALHAR-2426 : Regex Parser Documentation You can merge this pull request into a Git repository by running: $ git pull https://github.com/venkateshkottapalli/apex-malhar APEXMALHAR-2426-RegexParserDocumentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #565 commit 69adc9a7bef8c15258bd4df8b7e39d5f9871d6cc Author: venkateshDTDate: 2017-02-25T00:19:28Z RegexParser Documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (APEXMALHAR-2218) RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO
[ https://issues.apache.org/jira/browse/APEXMALHAR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkatesh Kottapalli resolved APEXMALHAR-2218. -- Resolution: Fixed Merged. > RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO > -- > > Key: APEXMALHAR-2218 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2218 > Project: Apache Apex Malhar > Issue Type: New Feature >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Fix For: 3.7.0 > > > A generic parser which takes the regex pattern as a parameter and parse the > log file line by line. > Input: Byte Array > Output: POJO > Parameters: POJO Schema Configuration is required and the order of schema > should match with the lines from the log file. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (APEXMALHAR-2218) RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO
[ https://issues.apache.org/jira/browse/APEXMALHAR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkatesh Kottapalli updated APEXMALHAR-2218: - Fix Version/s: 3.7.0 Issue Type: New Feature (was: Improvement) > RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO > -- > > Key: APEXMALHAR-2218 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2218 > Project: Apache Apex Malhar > Issue Type: New Feature >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > Fix For: 3.7.0 > > > A generic parser which takes the regex pattern as a parameter and parse the > log file line by line. > Input: Byte Array > Output: POJO > Parameters: POJO Schema Configuration is required and the order of schema > should match with the lines from the log file. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2218) RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO
[ https://issues.apache.org/jira/browse/APEXMALHAR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885445#comment-15885445 ] ASF GitHub Bot commented on APEXMALHAR-2218: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/396 > RegexParser- Operator to parse byte stream using Regex pattern and emit a POJO > -- > > Key: APEXMALHAR-2218 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2218 > Project: Apache Apex Malhar > Issue Type: Improvement >Reporter: Venkatesh Kottapalli >Assignee: Venkatesh Kottapalli > > A generic parser which takes the regex pattern as a parameter and parse the > log file line by line. > Input: Byte Array > Output: POJO > Parameters: POJO Schema Configuration is required and the order of schema > should match with the lines from the log file. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] apex-malhar pull request #396: APEXMALHAR-2218-Creation of RegexSplitter ope...
Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/396 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2427) Kinesis Input Operator documentation
[ https://issues.apache.org/jira/browse/APEXMALHAR-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885440#comment-15885440 ] ASF GitHub Bot commented on APEXMALHAR-2427: GitHub user deepak-narkhede opened a pull request: https://github.com/apache/apex-malhar/pull/564 APEXMALHAR-2427 Kinesis Input Operator documentation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/deepak-narkhede/apex-malhar APEXMALHAR-2427 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/564.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #564 commit a12bd8794ef2886e87db834136df365b5c08a1d4 Author: deepak-narkhedeDate: 2017-02-27T09:21:57Z APEXMALHAR-2427 Kinesis Input Operator documentation. > Kinesis Input Operator documentation > > > Key: APEXMALHAR-2427 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2427 > Project: Apache Apex Malhar > Issue Type: Documentation >Reporter: Deepak Narkhede >Assignee: Deepak Narkhede > -- This message was sent by Atlassian JIRA (v6.3.15#6346)