[GitHub] apex-malhar pull request #545: APEXMALHAR-2376 Add Common Log support in Log...
GitHub user akshay-harale opened a pull request: https://github.com/apache/apex-malhar/pull/545 APEXMALHAR-2376 Add Common Log support in LogParser operator https://issues.apache.org/jira/browse/APEXMALHAR-2376 You can merge this pull request into a Git repository by running: $ git pull https://github.com/akshay-harale/apex-malhar APEXMALHAR-2376-COMMON_LOG Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/545.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #545 commit e5a6fd35ded1560755dbdf2c8363ea4629458c62 Author: akshayDate: 2017-01-31T07:06:38Z APEXMALHAR-2376 Add Common Log support in LogParser operator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Schema Discovery Support in Apex Applications
Consumer of output port operator schema is going next downstream operator. On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovkowrote: > Sorry, I’m a new person in the APEX team. And I don't understand clearly > who are consumers of the output port operator schema(s). > > 1. If the consumers are non-run-time callers like the application manager > or UI designer, maybe it makes sense to use Java static method(s) to > retrieve the output port operator schema(s). I guess the performance of a > single call of a static method via reflection can be ignored. > > 2. If the consumer is next downstream operator, maybe it makes sense to > send an output port operator schema from upstream operator to next > downstream operator via the stream. The corresponded methods that would > send and receive the schema should be declared in the > interface/abstract-class of the upstream and downstream operators. The > sending/receiving of an output schema should be processed right before the > sending of the first data record via the stream. > > One of examples of a typical implementation for sending of metadata with a > regular result set is the sending of JDBC metadata as a part of JDBC result > set. And I hope the output schema (metadata of the streamed data) in the > implementation should contain not only a signature of the streamed objects > (like field names and data types), but also any other properties of the > data that can be useful by the schema receiver to process the data (for > instance, a delimiter for CSV record stream). > > Thanks, > Sergey > > On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar > wrote: > > Thank you all for the feedback. > > > > I've created a Jira for this: APEXCORE-623 and I'll attach the same > > document and link to this mailchain there. > > > > As a first part of this Jira, there are 2 steps I would like to propose: > > 1. Add following interface at com.datatorrent.common.util.SchemaAware. > > > > interface SchemaAware { > > > > Map registerSchema(Map > inputSchema); > > } > > > > This interface can be implemented by Operators to communicate its output > > schema(s) to engine. > > Input to this schema will be schema at its input port. > > > > 2. After LogicalPlan is created call SchemaAware method from upstream to > > downstream operator in the DAG to propagate the Schema. > > > > Once this is done, changes can be done in Malhar for the operators in > > question. > > > > Please share your opinion on this approach. > > > > Thanks, > > Chinmay. > > > > > > > > > > On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale > wrote: > > > > > +1 to have this feature. > > > > > > -Priyanka > > > > > > On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni < > pra...@datatorrent.com> > > > wrote: > > > > > > > +1 > > > > > > > > On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar < > chin...@apache.org> > > > > wrote: > > > > > > > > > Hi All, > > > > > > > > > > Currently a DAG that is generated by user, if contains any POJOfied > > > > > operators, TUPLE_CLASS attribute needs to be set on each and every > port > > > > > which receives or sends a POJO. > > > > > > > > > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> > > > > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by > user > > > on > > > > > both input and output ports of transform, dedup operators and also > on > > > > > parser output and formatter input. > > > > > > > > > > The proposal here is to reduce work that is required by user to > > > configure > > > > > the DAG. Technically speaking if an operators knows input schema > and > > > > > processing properties, it can determine output schema and convey > it to > > > > > downstream operators. This way the complete pipeline can be > configured > > > > > without user setting TUPLE_CLASS or even creating POJOs and adding > them > > > > to > > > > > classpath. > > > > > > > > > > On the same idea, I want to propose an approach where the pipeline > can > > > be > > > > > configured without user setting TUPLE_CLASS or even creating POJOs > and > > > > > adding them to classpath. > > > > > Here is the document which at a high level explains the idea and a > high > > > > > level design: > > > > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ > > > > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing > > > > > > > > > > I would like to get opinion from community about feasibility and > > > > > applications of this proposal. > > > > > Once we get some consensus we can discuss the design in details. > > > > > > > > > > Thanks, > > > > > Chinmay. > > > > > > > > > > > > > > >
[GitHub] apex-malhar pull request #544: APEXMALHAR-2397 #resolve Removing DAG.GATEWAY...
GitHub user sashadt opened a pull request: https://github.com/apache/apex-malhar/pull/544 APEXMALHAR-2397 #resolve Removing DAG.GATEWAY_CONNECT_ADDRESS which i⦠â¦s causing evaluation failures during apex get-app-package-info call You can merge this pull request into a Git repository by running: $ git pull https://github.com/sashadt/apex-malhar PiDemoAppData-DAG-null.APEXMALHAR-2397 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/544.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #544 commit b61d4fc21f2cec23be0643a11e1f1533d65fa5e5 Author: sashadtDate: 2017-01-31T02:48:18Z APEXMALHAR-2397 #resolve Removing DAG.GATEWAY_CONNECT_ADDRESS which is causing evaluation failures during apex get-app-package-info call --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] apex-core pull request #461: APEXCORE-504 - Possible race condition in Strea...
GitHub user vrozov opened a pull request: https://github.com/apache/apex-core/pull/461 APEXCORE-504 - Possible race condition in StreamingContainerAgent.getStreamCodec() @PramodSSImmaneni or @tweise Please review You can merge this pull request into a Git repository by running: $ git pull https://github.com/vrozov/apex-core APEXCORE-504 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #461 commit 29ca3ef1966b1ca2071136dd57ce860f05dfcf21 Author: Vlad RozovDate: 2017-01-31T01:24:45Z APEXCORE-504 - Possible race condition in StreamingContainerAgent.getStreamCodec() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Schema Discovery Support in Apex Applications
Sorry, Iâm a new person in the APEX team. And I don't understand clearly who are consumers of the output port operator schema(s). 1. If the consumers are non-run-time callers like the application manager or UI designer, maybe it makes sense to use Java static method(s) to retrieve the output port operator schema(s). I guess the performance of a single call of a static method via reflection can be ignored. 2. If the consumer is next downstream operator, maybe it makes sense to send an output port operator schema from upstream operator to next downstream operator via the stream. The corresponded methods that would send and receive the schema should be declared in the interface/abstract-class of the upstream and downstream operators. The sending/receiving of an output schema should be processed right before the sending of the first data record via the stream. One of examples of a typical implementation for sending of metadata with a regular result set is the sending of JDBC metadata as a part of JDBC result set. And I hope the output schema (metadata of the streamed data) in the implementation should contain not only a signature of the streamed objects (like field names and data types), but also any other properties of the data that can be useful by the schema receiver to process the data (for instance, a delimiter for CSV record stream). Thanks, Sergey On 2017-01-25 01:47 (-0800), Chinmay Kolhatkarwrote: > Thank you all for the feedback. > > I've created a Jira for this: APEXCORE-623 and I'll attach the same > document and link to this mailchain there. > > As a first part of this Jira, there are 2 steps I would like to propose: > 1. Add following interface at com.datatorrent.common.util.SchemaAware. > > interface SchemaAware { > > Map registerSchema(Map inputSchema); > } > > This interface can be implemented by Operators to communicate its output > schema(s) to engine. > Input to this schema will be schema at its input port. > > 2. After LogicalPlan is created call SchemaAware method from upstream to > downstream operator in the DAG to propagate the Schema. > > Once this is done, changes can be done in Malhar for the operators in > question. > > Please share your opinion on this approach. > > Thanks, > Chinmay. > > > > > On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale wrote: > > > +1 to have this feature. > > > > -Priyanka > > > > On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni > > wrote: > > > > > +1 > > > > > > On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar > > > wrote: > > > > > > > Hi All, > > > > > > > > Currently a DAG that is generated by user, if contains any POJOfied > > > > operators, TUPLE_CLASS attribute needs to be set on each and every port > > > > which receives or sends a POJO. > > > > > > > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup -> > > > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user > > on > > > > both input and output ports of transform, dedup operators and also on > > > > parser output and formatter input. > > > > > > > > The proposal here is to reduce work that is required by user to > > configure > > > > the DAG. Technically speaking if an operators knows input schema and > > > > processing properties, it can determine output schema and convey it to > > > > downstream operators. This way the complete pipeline can be configured > > > > without user setting TUPLE_CLASS or even creating POJOs and adding them > > > to > > > > classpath. > > > > > > > > On the same idea, I want to propose an approach where the pipeline can > > be > > > > configured without user setting TUPLE_CLASS or even creating POJOs and > > > > adding them to classpath. > > > > Here is the document which at a high level explains the idea and a high > > > > level design: > > > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ > > > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing > > > > > > > > I would like to get opinion from community about feasibility and > > > > applications of this proposal. > > > > Once we get some consensus we can discuss the design in details. > > > > > > > > Thanks, > > > > Chinmay. > > > > > > > > > >
[GitHub] apex-core pull request #446: APEXCORE-610 Avoid multiple calls to getBytes.
Github user asfgit closed the pull request at: https://github.com/apache/apex-core/pull/446 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: One Yarn with Multiple Apex Applications
Are you running on the sandbox (if so, what version ?) or your own cluster ? In either case, please check the following configuration item in capacity-scheduler.xml: yarn.scheduler.capacity.maximum-am-resource-percent 0.1 Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. Try increasing the value from 0.1 to 0.5, restart YARN and try launching multiple applications again. Ram On Mon, Jan 30, 2017 at 12:54 AM, Santhosh Kumari G < santhosh.kum...@qolsys.com> wrote: > Hi, > > Can we launch more than one (multiple) apex engine in one node with > multiple terminals and one yarn running. If yes, what is the process. > > I tried launching 2 apex apps with 2 apex engine's. First apex app is > running without any issue using the port 8042 configured in > yarn-default.xml Then I tried to launch 2nd app it is saying accepted but > not running as 8042 port is already in use. When I killed the 1st app,2nd > app is getting launched automatically. > > So can we manage one yarn with multiple apex engine app's?. > > Thank you, > Santhosh Kumari G. >
Re: One Yarn with Multiple Apex Applications
Hi Santhosh, We can definitely run multiple Apex applications on a single yarn instance. The behaviour in your case is most probably due to shortage of resources that were required by the second application. Once the first application was killed, the resources were released and the second application got all its required resources and started running. Ajay On Mon, 30 Jan 2017 at 8:47 PM, Santhosh Kumari G < santhosh.kum...@qolsys.com> wrote: > Hi, > > Can we launch more than one (multiple) apex engine in one node with > multiple terminals and one yarn running. If yes, what is the process. > > I tried launching 2 apex apps with 2 apex engine's. First apex app is > running without any issue using the port 8042 configured in > yarn-default.xml Then I tried to launch 2nd app it is saying accepted but > not running as 8042 port is already in use. When I killed the 1st app,2nd > app is getting launched automatically. > > So can we manage one yarn with multiple apex engine app's?. > > Thank you, > Santhosh Kumari G. >
Re: One Yarn with Multiple Apex Applications
Hi Santhosh, It seems that your YARN does not have enough resources available for allocating memory for 2 application. When you kill the first application, the memory is regained by yarn and then allocated to the second application. You can try to give more memory to yarn if your system allows RAM. You can add a property as follows in yarn-site.xml and restart yarn services: yarn.nodemanager.resource.memory-mb 8192 You can give the value in MB as per availability of RAM on your machine. -Chinmay. On Mon, Jan 30, 2017 at 2:24 PM, Santhosh Kumari G < santhosh.kum...@qolsys.com> wrote: > Hi, > > Can we launch more than one (multiple) apex engine in one node with > multiple terminals and one yarn running. If yes, what is the process. > > I tried launching 2 apex apps with 2 apex engine's. First apex app is > running without any issue using the port 8042 configured in > yarn-default.xml Then I tried to launch 2nd app it is saying accepted but > not running as 8042 port is already in use. When I killed the 1st app,2nd > app is getting launched automatically. > > So can we manage one yarn with multiple apex engine app's?. > > Thank you, > Santhosh Kumari G. >
Re: [DISCUSS] Policy for patches
You make some fair points, a contributor may not want to submit patches for all release branches but the community can pick it up as a policy. In many cases, it might be as simple as the reviewer cherry picking the fix onto the other branches. In cases where it is not trivial and the reviewer or somebody in the community cannot help out at that time, we can put in the backlog till somebody picks it up and possibly use JIRA to track this backlog. On Sun, Jan 29, 2017 at 11:22 AM, Thomas Weisewrote: > The problem with this discussion is that it assumes that a policy could be > established to apply patches. Any form of contribution to the project is > volunteer work, so this is a no starter. > > For example, if someone contributes a patch, then there is no way to > enforce contribution of the patch for multiple branches. A contributor may > do it due to other interests (like the vendor having to support a customer > on that code base), another contributor may have no such incentive. > Likewise the committer reviewing the work cannot be forced to repeat the > same for multiple branches. > > I would like to see vendor motivation cleanly separated from community > concerns. > > Perhaps it makes sense to come up with recommendations or guidelines around > this though. For example there is in general little incentive for the > community to release from outdated branches or to release code with known > issues such as CVE. And there is a release process and vote to deal with > it. > > For example, one could recommend to not maintain a minor release branch > after there have been n (2?) more recent minor or major releases. Or that > we don't want maintenance releases for minor releases that have security > issues. > > Looking at the current situation, I think that 3.2 and 3.3. could be > considered obsolete from community perspective (which does not stop a > vendor to add patches and consume it for their purposes). > > Soon (maybe when 3.6 is out?) there should be little reason to maintain 3.4 > (3.5 is backward compatible and users should be incentivised to move up). > > I also think that under the current contribution guidelines there is no > need to remove branches (even when they are fully reflected in tags). See > apex-core repository. > > I do think however that it may be good to clean up the pre ASF branches in > apex-malhar. > > Thomas > > > > On Fri, Jan 27, 2017 at 11:04 AM, Vlad Rozov > wrote: > > > I prefer to go with the second approach as well. > > > > My preference is to go not with a strict end of life policy, but by > > severity of an issue and complexity of providing fixes for all subsequent > > releases. In a case a contributor decides to fix a bug in > > an old release, she will need to provide the fix for many branches. It is > > unlikely that such work will be done without justification. > > > > I am strongly against deleting old branches: > > - they preserve history. > > - I am not 100% sure, but it is likely against ASF policy. Any > > contribution to a project needs to be preserved (including author of a > > commit). > > - It does not cost much to have branches in remote git repository and it > > does not affect git operations > > - It is not necessary to load all branches into local repository > > > > Thank you, > > > > Vlad > > > > > > On 1/27/17 10:16, Sanjay Pujare wrote: > > > >> A strong +1 for the second approach for the reasons Pramod mentioned. > >> > >> Is it also possible to “prune” branches so that we have less of this > >> activity of merging fixes across branches? If we can ascertain that a > >> certain branch is not used by any user/customer (by asking in the > >> community) we should be able to remove it. For example, apex-malhar has > >> release-3.6 which is definitely required but 3 year old branches like > >> release-0.8.5, release-0.9.0, … telecom most probably are not being > used by > >> anybody. > >> > >> On 1/27/17, 8:43 AM, "Pramod Immaneni" wrote: > >> > >> Hi, > >> I wanted to bring up the topic of patches for issues > discovered > >> in older > >> releases and start a discussion to come up with a policy on how to > >> apply > >> them. > >> One approach is the patch gets only applied to the release it > >> was > >> discovered in and master. Another approach is it gets applied to > all > >> release branches >= discovered release and master. There may be > other > >> approaches as well which can come up in this discussion. > >> The advantage of the first approach is that the immediate work > >> is limited > >> to a single fix and merge. The second approach requires more work > >> initially > >> as the patch needs to get applied to more one or more places. > >> I am tending towards the second approach of applying the fix > to > >> all release > >> branches >= discovered release, while also having some sort of an >
One Yarn with Multiple Apex Applications
Hi, Can we launch more than one (multiple) apex engine in one node with multiple terminals and one yarn running. If yes, what is the process. I tried launching 2 apex apps with 2 apex engine's. First apex app is running without any issue using the port 8042 configured in yarn-default.xml Then I tried to launch 2nd app it is saying accepted but not running as 8042 port is already in use. When I killed the 1st app,2nd app is getting launched automatically. So can we manage one yarn with multiple apex engine app's?. Thank you, Santhosh Kumari G.
Re: APEXMALHAR-2261 Python Binding for HighLevel APIs
Hi Thomas, I had looked at APEXMALHAR-2260 as well and it will also be part of this development. Though Apex provide python script operator, it is actually very limited script implementation. Lambda function or custom python functions which may have to run as scripts in python operator can be serialised using CloudPickle and run on various nodes. I am still investigating how to ensure that all libraries required by python code made available to operators running on different nodes. One of the approach suggested by cloudera is to make sure all libraries are available on each node of the cluster. This was suggested with respect to pyspark jobs . Please do suggest better alternative for making python environment available as required even in cluster environment. Thanks & Regards, Vikram On Sun, Jan 29, 2017 at 1:11 AM, Thomas Weisewrote: > Hi, > > Python support would be great to have. Users look for the ability to use > Python with its library ecosystem. How will that be possible with this API > proposal? > > I suspect that just being able to wire operators in Python is of limited > impact when operators cannot execute Python. Have you looked > at APEXMALHAR-2260 as well? > > Thanks > > > On Fri, Jan 27, 2017 at 11:39 PM, vikram patil > wrote: > > > Hi All, > > > > I would like to take up development for python binding implementation for > > highlevel APIs (APEXMALHAR-2261 ). I went over High-Level APIs from > Apache > > Malhar Stream API project. It can be initiated as separated project in > the > > Apache Malhar project just like sql or stream project. > > > > In first phase I would like to focus on providing python binding for > > following APIs: > > > > 1) StreamFactory.fromFolder > > 2) StreamFactory.fromKafka* > > 3) StreamFactory.fromLocal > > 4) StreamFactory.fromInput > > 5) ApexStream.map > > 6) ApexStream.flatMap > > 7) ApexStream.filter > > 9) ApexStream.endWith > > 11) ApexStream.setGlobalAttribute > > 12) Custom functions in python . > > > > > > Rest of the Apex HighLevel APIs such as addStream, addOperator can be > > implemented as part of phase II . > > > > > > For integration of this purpose,I would like to use py4j as python-java > > binding due to wide acceptance and very good community support. Also > py4j > > also allows call backs to python code from java which can make certain > > functionalities easier to implement. > > > > Py4j Version: 0.10.4 > > > > Please share your suggestions about this implementation. > > > > Thanks & Regards, > > Vikram > > >
Re: APEXMALHAR-2261 Python Binding for HighLevel APIs
Hi Thomas, I had looked at APEXMALHAR-2260 as well and it will also be part of this development. Though Apex provide python script operator, it is actually very limited script implementation. Lambda function or custom python functions which may have to run as scripts in python operator can be serialised using CloudPickle and run on various nodes. I am still investigating how to ensure that all libraries required by python code made available to operators running on different nodes. One of the approach suggested by cloudera is to make sure all libraries are available on each node of the cluster. This was suggested with respect to pyspark jobs . Thanks & Regards, Vikram On Sun, Jan 29, 2017 at 1:11 AM, Thomas Weisewrote: > Hi, > > Python support would be great to have. Users look for the ability to use > Python with its library ecosystem. How will that be possible with this API > proposal? > > I suspect that just being able to wire operators in Python is of limited > impact when operators cannot execute Python. Have you looked > at APEXMALHAR-2260 as well? > > Thanks > > > On Fri, Jan 27, 2017 at 11:39 PM, vikram patil > wrote: > > > Hi All, > > > > I would like to take up development for python binding implementation for > > highlevel APIs (APEXMALHAR-2261 ). I went over High-Level APIs from > Apache > > Malhar Stream API project. It can be initiated as separated project in > the > > Apache Malhar project just like sql or stream project. > > > > In first phase I would like to focus on providing python binding for > > following APIs: > > > > 1) StreamFactory.fromFolder > > 2) StreamFactory.fromKafka* > > 3) StreamFactory.fromLocal > > 4) StreamFactory.fromInput > > 5) ApexStream.map > > 6) ApexStream.flatMap > > 7) ApexStream.filter > > 9) ApexStream.endWith > > 11) ApexStream.setGlobalAttribute > > 12) Custom functions in python . > > > > > > Rest of the Apex HighLevel APIs such as addStream, addOperator can be > > implemented as part of phase II . > > > > > > For integration of this purpose,I would like to use py4j as python-java > > binding due to wide acceptance and very good community support. Also > py4j > > also allows call backs to python code from java which can make certain > > functionalities easier to implement. > > > > Py4j Version: 0.10.4 > > > > Please share your suggestions about this implementation. > > > > Thanks & Regards, > > Vikram > > >