Re: [Feature Proposal] Move Apache Apex website to Jekyll

2018-06-29 Thread Ananth G
Thanks for the feedback Thomas and Tushar.

@Thomas: I am wondering if the javadocs ( if that is what you meant by
documentation being a release artefact ) can be hosted on an external
system ? ( Ex: Flink seems to be simply pointing to ci.apache.org). I will
cut down the top level menu of "Ecosystem". Is there anything else that you
see as crowded ? Also could you please point out what are the
considerations for Apache Theme ?

I have only seen two votes and hence I am in a dilemma whether I take this
forward ?

Regards,
Ananth


On Sun, Jun 24, 2018 at 5:16 AM, Thomas Weise  wrote:

> +1 for website revamp with Jekyll.
>
> Regarding the details:
>
> Documentation is a release artifact that needs to remain decoupled from
> other web site content. I would propose to keep the documentation
> build/release process out of this effort.
>
> The top level navigation looks a bit crowded. I would suggest to reduce the
> number of items. Also, we should discuss what should appear there, to fit
> the overall ASF theme.
>
> Thanks
>
> On Thu, Jun 21, 2018, 4:33 AM Ananth G  wrote:
>
> > Hello All,
> >
> > Per some JIRA tickets and other PR review comments, having a single
> > cohesive website along with blogging capabilities seems to be long due
> for
> > Apache Apex.
> >
> > I would like to propose moving the Apache Apex website to Jekyll to
> address
> > the following high level concerns:
> >
> > - Aim to provide a better build experience for the website
> > - Make it easier to contribute blogs
> > - Make the approval process easier by making it more markup reliant
> > - Organise the content better in terms of content layout so that we can
> > maintain it better.
> >
> >
> > Here is a "skeleton" preview of the Jekyll based site and compatible with
> > GitHub pages.
> > - https://apacheapex.github.io/index.html
> > - Each page can have a customised sidebar like
> > https://apacheapex.github.io/architecture.html
> > and https://apacheapex.github.io/writing_to_kudu_using_apex.html
> > - There is a menu for blogs giving latest and archived views of blogs.
> > https://apacheapex.github.io/latestblogentries.html
> > - Mobile compatible as well.
> >
> >
> > Here are some call outs regarding the implementation:
> >
> > - The above url is only a temporary host to showcase the skeleton
> structure
> > and will be migrated to the apex-site url as a separate branch once I get
> > some consensus.
> > - There is a BSD styled licensed component  that I need guidance on
> >
> > https://github.com/apacheapex/apacheapex.github.io/blob/
> master/LICENSE-BSD-NAVGOCO.txt
> > .
> > - We will need more contribution from someone who has bootstrap
> experience
> > to fix a few aspects on the home page (scrolling powered by section and
> the
> > top level Apache Apex header sections)
> > - The effort is going to span a few months as it is an entire website
> > migration and there is a lot of content that needs to be collated and
> > possibly updated as well as we progress
> > - The site is based on the following project
> > https://github.com/tomjoht/documentation-theme-jekyll
> >
> >
> > My ask as part of the approval process:
> >
> > - A high level approval to the approach
> > - Agreement on the high level menu and the drop downs at the top -
> Comments
> > welcome
> > - Usage of Jekyll as the main mechanism to generate content
> > - Commenting is disabled consciously
> > - BSD styled license inclusion for sidebar NAVGOCO component
> > - Migration of the content from multiple locations (Malhar documentation
> ,
> > Apex Documentation ) into a single cohesive set.
> > - Altering of the build process for the website - I do not have an idea
> as
> > to how compiled content in Jenkins is pushed to apache domain.
> >
> >
> > May I request if anyone else is interested in contributing to this effort
> > as well? I will wait for comments for the next 5 days and create JIRA
> > tickets accordingly.
> >
> > Regards,
> > Ananth
> >
>


[Feature Proposal] Move Apache Apex website to Jekyll

2018-06-20 Thread Ananth G
Hello All,

Per some JIRA tickets and other PR review comments, having a single
cohesive website along with blogging capabilities seems to be long due for
Apache Apex.

I would like to propose moving the Apache Apex website to Jekyll to address
the following high level concerns:

- Aim to provide a better build experience for the website
- Make it easier to contribute blogs
- Make the approval process easier by making it more markup reliant
- Organise the content better in terms of content layout so that we can
maintain it better.


Here is a "skeleton" preview of the Jekyll based site and compatible with
GitHub pages.
- https://apacheapex.github.io/index.html
- Each page can have a customised sidebar like
https://apacheapex.github.io/architecture.html
and https://apacheapex.github.io/writing_to_kudu_using_apex.html
- There is a menu for blogs giving latest and archived views of blogs.
https://apacheapex.github.io/latestblogentries.html
- Mobile compatible as well.


Here are some call outs regarding the implementation:

- The above url is only a temporary host to showcase the skeleton structure
and will be migrated to the apex-site url as a separate branch once I get
some consensus.
- There is a BSD styled licensed component  that I need guidance on
https://github.com/apacheapex/apacheapex.github.io/blob/master/LICENSE-BSD-NAVGOCO.txt
.
- We will need more contribution from someone who has bootstrap experience
to fix a few aspects on the home page (scrolling powered by section and the
top level Apache Apex header sections)
- The effort is going to span a few months as it is an entire website
migration and there is a lot of content that needs to be collated and
possibly updated as well as we progress
- The site is based on the following project
https://github.com/tomjoht/documentation-theme-jekyll


My ask as part of the approval process:

- A high level approval to the approach
- Agreement on the high level menu and the drop downs at the top - Comments
welcome
- Usage of Jekyll as the main mechanism to generate content
- Commenting is disabled consciously
- BSD styled license inclusion for sidebar NAVGOCO component
- Migration of the content from multiple locations (Malhar documentation ,
Apex Documentation ) into a single cohesive set.
- Altering of the build process for the website - I do not have an idea as
to how compiled content in Jenkins is pushed to apache domain.


May I request if anyone else is interested in contributing to this effort
as well? I will wait for comments for the next 5 days and create JIRA
tickets accordingly.

Regards,
Ananth


Re: Proposal for UI component in stram

2018-06-10 Thread Ananth G
+1 for a UI for Apex.

-0 for making it part of the STRAM.

- Regarding same port, I was wondering if there will be usage patterns
based on firewall rules just based on ports ? This leads us to a situation
where we have to trust all users at the same level as the application
communication patterns.
- Deepak N mentioned of building metrics components. Assuming he is working
on this feature and is providing a REST API as well, there might be
overlaps in the functionality being implemented. Some part of the UI
features mentioned in the list seem to be definitely metrics related. Most
of the metrics UI views can be aggregated at external web applications like
Grafana of course with the help of other tooling like REST server from
atrato or Prometheus components.
- Are we considering serving metrics beyond application restarts? If yes,
visualising metrics is a better fit for tooling carved out for those
purposes given the amount of information will be huge and we will also need
potentially time series stores ?
- The highest value is from the points mentioned in the second phase and
DAG views implementation in the first phase.( assuming Deepak N proposal
for metrics is implementing a REST API).

There is definitely a huge value add in providing a UI and the need of the
hour for Apex. My concern is that we might be diluting the value add
provided to build such a functionality inside the STRAM and hence feel it
is better served as an external application as opposed to something
embedded in the STRAM.

Regards,
Ananth


On Sun, Jun 10, 2018 at 3:10 AM, Thomas Weise  wrote:

> strong +1 for adding a UI to Apex. This is actually something that users
> expect and are accustomed to with similar projects. Thanks for taking the
> initiative.
>
> There are nuances to be discussed, but the overall plan LGTM.
>
> To your question, I think this should use the existing web port with a
> different path, this will make it easy to expose through the RM proxy. The
> question whether the port is static or dynamic is a different concern,
> there could be a separate feature that optionally allows the user to assign
> a static port. On YARN, there is already a static address, the RM proxy
> port.
>
> I would propose to start this feature on a branch. The requirements for
> build integration and source code structure should be fleshed out as you
> go. It is conceivable that the UI build, which requires separate tooling,
> runs as part of the release build, but not the usual CI.
>
> Thomas
>
>
>
> On Wed, Jun 6, 2018 at 7:43 AM, Chinmay Kolhatkar 
> wrote:
>
> > Hi Community,
> >
> > I want to propose a feature in apex-core to have a UI component in stram.
> > This is influenced by how basic runtime information is shown on UI by
> > spark.
> >
> > This includes following features:
> > 1. Webpage will be hosted by stram and will be available for user to view
> > in one of the two ways (I'll prefer approach b):
> >a. Hosted on a static port in which case we'll need to add a new
> servlet
> > to stram
> >b. The webpage will be hosted on the same port as that of stram
> > webservice but at a different path. Say, http://
> > :/ui
> >
> > 2. The webpage will be a static page and depending on the framework we
> > chose, it can show the realtime metric data from stram.
> >
> > 3. There will be a categories of readonly information (maybe dynamically
> > changing) that will be shown on the UI:
> >a. Application level information
> >i. Status
> >ii. Number of tuples processed/emitted
> >iii. Start time/ Running span
> >iv. Stram events
> >v. Total memory used/available
> >vi. Number of container allocated/deployed
> >v. Other information which is available in stram
> >b. Logical Operator level information
> >i. Status
> >ii. Number of tuples processed/emitted
> >iii. Important events related to logical operator
> >iv. Container list in which the operator is deployed
> >v. Any other information available in stram
> >vi. Logical DAG View
> >c. Container level information
> >i. Status
> >ii. Number of tuples processed/emitted
> >iii. Important events related to logical operator
> >iv. Any other information available in stram
> >v. Physical DAG View
> >
> > 4. In second phase, there will be control related operations allowed from
> > the UI, as follows:
> >a. Stop/Kill application
> >b. Stop/Kill containers
> >c. Stack trace dump of containers
> >d. Logs related?
> >d. etc...??
> >
> > The above implementation can be done in phases as follows:
> > 1. Have webpage display application level information only (read-only).
> > Static display first (i.e. page needs to refresh to see latest data),
> next
> > dynamically changing data.
> > 2. Have jenkins build tools updated as needed by UI framework.
> > 3. Update the backend (stram) to serve the UI pages
> > 3. Extend the 

Re: [ANNOUNCE] New Apache Apex PMC Member: Chinmay Kolhatkar

2018-05-24 Thread Ananth G
Congratulations Chinmay 

Regards
Ananth

> On 25 May 2018, at 4:19 am, amol kekre  wrote:
> 
> Congrats Chinmay
> 
> Amol
> 
> On Thu, May 24, 2018 at 10:34 AM, Hitesh Kapoor 
> wrote:
> 
>> Congratulations Chinmay!!
>> 
>> Regards,
>> Hitesh Kapoor
>> 
>>> On Thu 24 May, 2018, 11:02 PM Ilya Ganelin,  wrote:
>>> 
>>> Congrats!
>>> 
>>> On Thu, May 24, 2018, 10:09 AM Pramod Immaneni 
>> wrote:
>>> 
 Congratulations Chinmay.
 
> On Thu, May 24, 2018 at 9:39 AM Thomas Weise  wrote:
> 
> The Apache Apex PMC is pleased to announce that Chinmay Kolhatkar is
>>> now
 a
> PMC member.
> 
> Chinmay has contributed to Apex in many ways, including:
> 
> - Various transform operators in Malhar
> - SQL translation based on Calcite
> - Apache Bigtop integration
> - Docker sandbox
> - Blogs and conference presentations
> 
> We appreciate all his contributions to the project so far, and are
 looking
> forward to more.
> 
> Congrats!
> Thomas, for the Apache Apex PMC.
> 
 
>>> 
>> 


Re: [DISCUSS] Maven version

2018-05-16 Thread Ananth G
My understanding is that we might be using the various config files support
- .mvn/jvm.config
- .mvn/mvn.config

version 3.5.0 allows for a manual override of the mvn config values from
the command line if the local development process needs it to be the case.
Hence +1 for 3.5.0

Plus 3.5.x seems to provide additional goodies like colour outputs as well
as a module progress indicators.

Regards,
Ananth


On Thu, May 17, 2018 at 5:29 AM, Yogi Devendra <yogideven...@apache.org>
wrote:

> 1. +1 for 3.3.9.
> 2. Why do we need 3.5.x if 3.3.9 serves the purpose?
>
> ~ Yogi
>
> On 16 May 2018 at 20:20, Ananth G <ananthg.a...@gmail.com> wrote:
>
> > +1 for maven version upgrade.
> >
> > +1 for version 3.5.x as there seems to be some fixes for issues related
> to
> > this property in version 3.3.9
> > https://issues.apache.org/jira/browse/MNG-5786 ?
> >
> >
> >
> > Regards,
> > Ananth
> >
> > On Wed, May 16, 2018 at 5:13 AM, Vlad Rozov <vro...@apache.org> wrote:
> >
> > > I think it is time to revisit minimum maven version required to build
> > > Apex. I suggest upgrading to 3.3.1 at minimum to get
> > > maven.multiModuleProjectDirectory property supported. Another option
> is
> > > to upgrade to 3.5.x.
> > >
> > > Thank you,
> > >
> > > Vlad
> > >
> >
>


Re: [DISCUSS] Maven version

2018-05-16 Thread Ananth G
+1 for maven version upgrade.

+1 for version 3.5.x as there seems to be some fixes for issues related to
this property in version 3.3.9
https://issues.apache.org/jira/browse/MNG-5786 ?



Regards,
Ananth

On Wed, May 16, 2018 at 5:13 AM, Vlad Rozov  wrote:

> I think it is time to revisit minimum maven version required to build
> Apex. I suggest upgrading to 3.3.1 at minimum to get
> maven.multiModuleProjectDirectory property supported. Another option is
> to upgrade to 3.5.x.
>
> Thank you,
>
> Vlad
>


Re: [Feature Proposal] Add metrics dropwizards like gauges, meters, histogram etc to Apex Platform

2018-05-09 Thread Ananth G
+1 for the feature.

+1 for an abstracted way of metrics library integration.

Some additional thoughts on this:

- We might have implications of adding a set of dependencies from drop
wizard into the engine core ( guava versions etc).
- Flink as pointed out by Thomas seems to have taken an interesting
approach to metrics ( apart from supporting multiple reporters ) - Any
reporter is only enabled if the relevant jar is put in the classpath.
- There seems to be two worlds for generating the metric names. 1. Dot
notation separated ( or some separator thereof. Ex: dropwizard style) and
2. high dimensional metric names using key value pair tags.  ( prometheus
style). We might have to "transform" the dot hierarchical notation to a key
value pair based notation based on the reporter that is chosen by the end
user.
- The scope of the implementation is too big as I understand it and perhaps
this feature needs to be done in multiple JIRAs. Also malhar is still at
java-7 while engine is at java-8.

Some questions:

- Is there a plan to expose the metric via a jetty end point on each JVM ?
( if configured )
- How do we plan to handle dynamic partitioning of operators for metrics ?
( i.e possibly short-lived JVMs )
- This feature implies that we will no longer support equivalent of
ComplexType of Autometrics ?
- This feature also implies that we will no longer support autometric
aggregators ? ( and leave this to the metric tools functionality)


Regards,
Ananth

On Thu, May 10, 2018 at 12:37 PM, Chinmay Kolhatkar <
chinmaykolhatka...@gmail.com> wrote:

> +1 for approach 1. As Thomas mentioned, I think there metrics layer should
> be abstracted out from its implementation. This way one can plugin
> different metrics systems in apex.
>
> Also keep in mind that there is a lot of code which uses AutoMetrics
> annotations. We should help a smooth transition from that. As next release
> will be a major version release, this is a good opportunity for getting rid
> of old AutoMetrics and counters functionality.
>
> Regards,
> Chinmay.
>
>
> On Thu, 10 May 2018, 4:47 am Vlad Rozov,  wrote:
>
> > +1 for the #1 proposal.
> >
> > Thank you,
> >
> > Vlad
> >
> > On 5/9/18 07:14, Thomas Weise wrote:
> > > +1 for the initiative
> > >
> > > Some thoughts:
> > >
> > > - it is probably good to retain a level of abstraction, to avoid direct
> > > dependency on dropwizard
> > > - support programmatic metric creation (not just annotations)
> > > - remove deprecated counter and auto-metric code and migrate operators
> to
> > > use new API
> > > - which metric reporting systems will be supported out of the box
> > >
> > > You can also take a look at how this was structured in Flink:
> > >
> > https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/
> metrics.html
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Tue, May 8, 2018 at 8:56 AM, Deepak Narkhede <
> mailtodeep...@gmail.com
> > >
> > > wrote:
> > >
> > >> Hi Community,
> > >>
> > >> I want to propose addition of metrics like gauges, meters, counters
> and
> > >> historgram for the following components.
> > >> 1) Addition of metrics for Container Stats.
> > >> 2) Addition of metrics for Operator Stats.
> > >> 3) Addition of  metrics for  Stram Application Master stats.
> > >> 4) Addition of metrics for JVM related stats for all containers.
> > >>
> > >> To implement them would be using metrics dropwizard api's. (
> > >> http://metrics.dropwizard.io/)
> > >> Use cases:
> > >> 1) Can be directly pushed to external visualisation system like
> > Graphite.
> > >> 2) Can be viewed in visualVM tools through JMX.
> > >> 3) Can be outputted to console.
> > >> 4) It is also possible to push the metrics to custom sink.
> > >>
> > >> We will also need to write sinks and reporter, if required for custom
> > >> sinks.
> > >>
> > >> Design/Implementation approach:
> > >> Way #1:
> > >> 1) Create new annotations like @MetricTypeGauge, @MetricTypeMeter,
> > >> @MetricTypeCounter, @MetricTypeHistogram. They can be both fields and
> > >> methods.
> > >> 2) Add them to respective methods or fields like StreamingContainer,
> > >> StreamingAppMasterService for extraction of relevant metrics.
> > >> 3) While Node creation ( InputNode/GeneticNode/OiONode), we create
> and
> > >> initialise  the metrics registry depending on components.
> > >> 4) While collectMetrics() part of operator runner thread (
> InputNode.run
> > >> /GenericNode.run), we actually invoke the annotations methods and
> > collect
> > >> different types of metrics.
> > >> 5) We can have a sink which pushes the metrics to reporter like
> Console,
> > >> JMX etc.
> > >>
> > >> Way #2:
> > >> Use existing AutoMetrics annotations, convert some metrics to
> different
> > >> types like gauge, counter etc..But this cannot be done generically as
> we
> > >> don't know the types. Still more investigation is going on this
> > approach.
> > >>
> > >> I would prefer first way.
> > >>
> > >> Note: There are 

Re: Apex-core build/release steps improvements proposal

2018-05-03 Thread Ananth G
+1 to all 3 considering we are trying to centralise the code.

2 should be redone eventually as part of
https://issues.apache.org/jira/browse/APEXCORE-796 ? But the design for
this needs to be seen in the broader context of some of the points
mentioned below:

Regarding 3, I agree that the current image is tightly coupled to bigtop.
While making it independent of bigtop is a starting step, I believe we
might need to revisit our thinking around as to how we would like to
implement containerisation for Apex in the first place.


There are multiple design items to be resolved for Apex containerisation:

1. Apex community needs to evaluate both Hadoop based and Hadoop free
architectures. For non-hadoop based architectures, we need to solve DFS
alternatives as well as the resource manager alternatives. Tickets like
https://issues.apache.org/jira/browse/APEXCORE-724 will bring this design
issue in more detail I believe.

2. Consider how Apex applications will be built as part of the build
process that results in a docker image of the Apex application ( That would
contain application code , malhar operators etc)

3. Consider how we would like to make use of Hadoop 3 support for Docker
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainers.html




Just curious about the docker implementation: Is the end goal of the docker
image to provide a sandbox for

1. Evaluating Apex or
2. Make Apex installable binary as an image or
3. Make Apex applications aligned with a docker build process ( Ex: Python
libraries installed on the image as part of the application code )?

The reason I raise these questions is that it does not make much sense to
bundle a cluster in a box with any distribution ( dockerizing a Hadoop
cluster is non-trivial and I have not heard good success stories around
this approach so far that can be enabled for production). The docker image
that embeds a Hadoop binary is thus only useful for evaluation wherein
everything is contained in the same image and nothing more.

My suspicion is that we will anyways would revisit this approach if our
goals are 2 and/or 3 as well. Perhaps we will address these questions as
part of https://issues.apache.org/jira/browse/APEXCORE-724 and
https://issues.apache.org/jira/browse/APEXCORE-796.

Regards,
Ananth

On Fri, May 4, 2018 at 10:31 AM, Vlad Rozov  wrote:

> +1 to all 3.
>
> Thank you,
>
> Vlad
>
>
> On 5/3/18 07:03, Thomas Weise wrote:
>
>> +1 to all of this
>>
>> There are existing JIRAs that you can assign / add to:
>>
>> https://issues.apache.org/jira/browse/APEXCORE-727
>>
>> Thanks!
>>
>>
>>
>> On Thu, May 3, 2018 at 4:26 AM, Chinmay Kolhatkar 
>> wrote:
>>
>> Hello Community,
>>>
>>> I want to propose following improvements for apex-core build and related
>>> steps:
>>>
>>> 1. Most (probably all of the open source project) has the a binary
>>> release
>>> package of the software and not just the source release package.
>>> Currently
>>> we have only source package. Luckily there are few places (outside of
>>> apache apex) where binary packages of apex has been created for different
>>> purposes : https://github.com/atrato/apex-cli-package &
>>> https://github.com/apache/bigtop)
>>>
>>> Proposal here is generate this binary release package as a part of build
>>> process of apex-core.
>>>
>>>
>>> 2. Currently, the docker build that is being created for apex is built
>>> one
>>> of my personal repository (https://github.com/chinmaykol
>>> hatkar/docker-pool
>>> ).
>>> While I don't mind hosting the content (Dockerfile etc...) in my
>>> repository, I believe it make sense to host this in apex-core repository.
>>> This way, there is a possibility of using docker github triggers for
>>> building the docker image from release branches.
>>>
>>>
>>> 3. Currently the docker build uses hadoop and apex specific packages from
>>> bigtop deb repo & CI. (See
>>> https://github.com/chinmaykolhatkar/docker-pool/
>>> blob/master/apex/ubuntu/app/setup.sh
>>> for more details)
>>> While use of hadoop packages from bigtop repo is fine, we also need to
>>> rely
>>> on bigtop contribution to update apex component and then build from
>>> bigtop
>>> CI for getting apex.deb package. Basically our docker image generation
>>> process gets blocked on bigtop source update to generate the updated apex
>>> deb.
>>> As we technically don't need to depend on bigtop to generate the apex
>>> binary, the proposal here is to generate binary package during build
>>> process (point 1) and use that during docker image build process instead
>>> of
>>> using the ready made deb package from bigtop CI.
>>>
>>>
>>> I understand that there are multiple items being mention in a single mail
>>> but they seem related hence the mail.
>>>
>>> Please let me know your opinion on above items.
>>>
>>> Thanks,
>>> Chinmay.
>>>
>>>
>


[Proposal] Switch to Java 8

2018-01-18 Thread Ananth G
Hello All,

I was wondering if we can consider making java 8 as the minimum supported
idk version going forward.

Since Apex core is moving to the next release cycle, perhaps we consider
this ? My assumption is that the next release version for core after the
3.7.0 release is going to be a major version bump ?

My understanding is that core needs to move to idk 8 before we can move
library/malhar to idk 8.

Here are some points to consider (There might be more points to the list
below) :

- Java 7 has been deprecated from a support model more than 2 years back
- Peer streaming frameworks like Flink are mandating Java 8
- Apache Beam is moving to Java 8 from its next release cycle
- Many hadoop vendors provide java 8 as the base for the platform.
- Will help the community to use java8 streams and lambdas in operators and
core
- There are certain maven dependencies which might be more optimal if java8
dependency is inherited as opposed to java 7version of the dependencies
- Junit5 might help in richer unit tests.

Regards,
Ananth


Re: Core release 3.7.0

2018-01-18 Thread Ananth G
+1 for the release.

Regards,
Ananth

On Fri, Jan 19, 2018 at 4:41 AM, Chinmay Kolhatkar 
wrote:

> +1 for the release.
>
> - Chinmay.
>
> On 18 Jan 2018 8:58 pm, "Tushar Gosavi"  wrote:
>
> > +1 for the release
> >
> > Regards,
> > - Tushar.
> >
> >
> > On Thu, Jan 18, 2018 at 3:28 AM, Amol Kekre 
> wrote:
> >
> > > +1
> > >
> > > Thks,
> > > Amol
> > >
> > >
> > > E:a...@datatorrent.com | M: 510-449-2606 | Twitter: @*amolhkekre*
> > >
> > > www.datatorrent.com
> > >
> > >
> > > On Wed, Jan 17, 2018 at 1:55 PM, Pramod Immaneni <
> pra...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > +1
> > > >
> > > > > On Jan 17, 2018, at 7:25 AM, Thomas Weise  wrote:
> > > > >
> > > > > Last release was 3.6.0 in May and following issues are ready for
> > > release:
> > > > >
> > > > > https://issues.apache.org/jira/issues/?jql=fixVersion%
> > > > 20%3D%203.7.0%20AND%20project%20%3D%20APEXCORE%20ORDER%20BY%
> > > 20status%20ASC
> > > > >
> > > > > Any opinions on cutting a release?
> > > > >
> > > > > Any committer interested running the release?
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > >
> > > >
> > >
> >
>


Re: [Discuss] Design of the python execution operator

2017-12-22 Thread Ananth G
I guess my comment below regarding overhead of serialisation in container local 
is wrong ? Nevertheless having a local thread implementation gives some 
benefits . For example I am using to whether sleep if there is no request in 
the queue or spin checking for request presence in the request queue etc to 
take care of no delays in the request queue processing itself.

Regards,
Ananth

> On 23 Dec 2017, at 6:50 am, Ananth G <ananthg.a...@gmail.com> wrote:
> 
> 
> Thanks for the comments Thomas and Pramod. 
> 
> Apologies for the delayed response on this thread. 
> 
> > I believe the thread implementation still adds some value over a container 
> > local approach. It is more of a “thread local” equivalent which is more 
> > efficient as opposed to a container local implementation. Also the number 
> > of worker threads is configurable. Setting the value of 1 will let the user 
> > to not do this ( although I do not see a reason for why not ). There is 
> > always the over head of serialise/de-serialize cycle even for a container 
> > local approach and there is the additional possibility of container local 
> > not being honoured by the Resource manager based on the state of the 
> > resources. 
> 
> > Regarding the configurable key to ensure all tuples in a window are 
> > processed, I am adding a switch which can let the user choose ( and javadoc 
> > that clearly points out issues if not waiting for the tuples to be 
> > completely processed ). There are pros and cons for this and letting the 
> > user decide might be a better approach. The reason why I mention cons for 
> > waiting the tuples to complete ( apart from the reason that Thomas 
> > mentioned ) is that if one of the commands that the user wrote is an 
> > erroneous one, all the subsequent calls to that interpreter thread cal 
> > fail. An example use case is that tuple A set some value for variable x and 
> > tuple B that is coming next is making use of the variable x. Syntactically 
> > expression for tuple B is valid but just that it depends on variable x. Now 
> > if the variable x is not in memory because tuple A is a straggler resulting 
> > in tuple B resulting in an erroneous interpreter state. Hence the operator 
> > might stall definitely as end window will be stalled forever resulting in 
> > killing of the operator ultimately. This is also because the erroneous 
> > command corrupted the state of the interpreter itself. Of course this can 
> > happen to all of the threads in the interpreter worker pool resulting in 
> > this state as well. Perhaps an improvement of the current implementation is 
> > to detect all such stalled interpreters for more than x windows and rebuild 
> > the interpreter thread when such a situation is detected. 
> 
> > Thanks for the IdleTimeoutHandler tip as this helped me to ensure that the 
> > stragglers are drained out irrespective of a new tuple coming in for 
> > processing. In the previous iteration, the stragglers could only be drained 
> > when there is a new tuple that came in processing as delayed responses 
> > queue could only be checked when there is some activity on the main thread. 
> 
> > Thanks for raising the point about the virtual environments: This is a 
> > point I missed mentioning in the design description below. There is no 
> > support for virtual environments yet in JEP and hence the current 
> > limitation. However the work around is simple. As part of the application 
> > configuration, we need to provide the JAVA_LIBRARY_PATH which contains the 
> > path to the JEP dynamic libraries. If there are multiple python installs ( 
> > and hence multiple JEP libraries to choose from for each of the apex 
> > applications that are being deployed), setting the right path for the 
> > operator JVM will result in picking the corresponding python interpreter 
> > version. This also essentially means that we cannot have a thread local 
> > deployment configuration of two python operators that belong to different 
> > python versions in the same JVM.  The Docker approach ticket should be the 
> > right fix for virtual environments issue? 
> > https://issues.apache.org/jira/browse/APEXCORE-796 
> > <https://issues.apache.org/jira/browse/APEXCORE-796> ( but still might not 
> > solve the thread local configuration deployment )
> 
> Regards,
> Ananth
> 
> 
>> On 21 Dec 2017, at 11:01 am, Pramod Immaneni <pra...@datatorrent.com 
>> <mailto:pra...@datatorrent.com>> wrote:
>> 
>> On Wed, Dec 20, 2017 at 3:34 PM, Thomas Weise <t...@apache.org 
>> <mailto:t...@apache.org>> wro

Re: [Discuss] Design of the python execution operator

2017-12-22 Thread Ananth G
ly by back pressure. In this case the
> operator is most likely going to be downsteram in the DAG and would have
> constraints for processing guarantees. For scalability, container local
> could also be used as a substitue for multiple threads without resorting to
> using separate containers. I can understand use of a separate thread to be
> able to get around problems like stalled processing but would first try to
> see if something like container local would work for scaling.
> 
> 
>> It is also correct that generally there is no ordering guarantee within a
>> streaming window, and that would be the case when multiple input ports are
>> present as well. (The platform cannot guarantee such ordering, this would
>> need to be done by the operator).
> 
> 
> 
>> Idempotency can be expensive (latency and/or space complexity), and not all
>> applications need it (like certain use cases that process record by record
>> and don't accumulate state). An example might be Python logic that is used
>> for scoring against a model that was built offline. Idempotency would
>> actually be rather difficult to implement, since the operator would need to
>> remember which tuples were emitted in a given interval and on replay block
>> until they are available (and also hold others that may be processed sooner
>> than in the original order). It may be easier to record emitted tuples to a
>> WAL instead of reprocessing.
>> 
> 
> Ordering cannot be guaranteed but the operator would need to finish the
> work it is given a window within the window boundary, otherwise there is a
> chance for data loss in recovery scenarios. You could make checkpoint the
> boundary by which all pending work is completed instead of every window
> boundary but then downstream operators cannot rely on window level
> idempotency for exactly once. Something like file output operator would
> work but not our db kind of operator. Both options could be supported in
> the operator.
> 
> 
>> Regarding not emitting stragglers until the next input arrives, can this
>> not be accomplished using IdleTimeHandler?
>> 
>> What is preventing the use of virtual environments?
>> 
>> Thanks,
>> Thomas
>> 
>> 
>> On Tue, Dec 19, 2017 at 8:19 AM, Pramod Immaneni <pra...@datatorrent.com>
>> wrote:
>> 
>>> Hi Ananth,
>>> 
>>> From your explanation, it looks like the threads overall allow you to
>>> achieve two things. Have some sort of overall timeout if by which a tuple
>>> doesn't finish processing then it is flagged as such. Second, it doesn't
>>> block processing of subsequent tuples and you can still process them
>>> meeting the SLA. By checkpoint, however, I think you should try to have a
>>> resolution one way or the other for all the tuples received within the
>>> checkpoint period or every window boundary (see idempotency below),
>>> otherwise, there is a chance of data loss in case of operator restarts.
>> If
>>> a loss is acceptable for stragglers you could let straggler processing
>>> continue beyond checkpoint boundary and let them finish when they can.
>> You
>>> could support both behaviors by use of a property. Furthermore, you may
>> not
>>> want all threads stuck with stragglers and then you are back to square
>> one
>>> so you may need to stop processing stragglers beyond a certain thread
>> usage
>>> threshold. Is there a way to interrupt the processing of the engine?
>>> 
>>> Then there is the question of idempotency. I suspect it would be
>> difficult
>>> to maintain it unless you wait for processing to finish for all tuples
>>> received during the window every window boundary. You may provide an
>> option
>>> for relaxing the strict guarantees for the stragglers like mentioned
>> above.
>>> 
>>> Pramod
>>> 
>>> On Thu, Dec 14, 2017 at 10:49 AM, Ananth G <ananthg.a...@gmail.com>
>> wrote:
>>> 
>>>> Hello Pramod,
>>>> 
>>>> Thanks for the comments. I adjusted the title of the JIRA. Here is
>> what I
>>>> was thinking for the worker pool implementation.
>>>> 
>>>> - The main reason ( which I forgot to mention in the design points
>> below
>>> )
>>>> is that the java embedded engine allows only the thread that created
>> the
>>>> instance to execute the python logic. This is more because of the JNI
>>>> specification itself. Some hints here https://stackoverflow.com/
>>>> questions/18056347/jni-cal

Re: [Discuss] Design of the python execution operator

2017-12-14 Thread Ananth G
Hello Pramod,

Thanks for the comments. I adjusted the title of the JIRA. Here is what I was 
thinking for the worker pool implementation.

- The main reason ( which I forgot to mention in the design points below ) is 
that the java embedded engine allows only the thread that created the instance 
to execute the python logic. This is more because of the JNI specification 
itself. Some hints here 
https://stackoverflow.com/questions/18056347/jni-calling-java-from-c-with-multiple-threads
 
<https://stackoverflow.com/questions/18056347/jni-calling-java-from-c-with-multiple-threads>
 and here 
http://journals.ecs.soton.ac.uk/java/tutorial/native1.1/implementing/sync.html 
<http://journals.ecs.soton.ac.uk/java/tutorial/native1.1/implementing/sync.html>

- This essentially means that the main operator thread will have to call the 
python code execution logic if the design were otherwise.

- Since the end user can choose to can write any kind of logic including 
blocking I/O as part of the implementation, I did not want to stall the 
operator thread for any usage pattern. 

- In fact there is only one overall interpreter in the JVM process space and 
the interpreter thread is just a JNI wrapper around it to account for the JNI 
limitations above.

- It is for the very same reason, there is an API in the implementation to 
support for registering Shared Modules across all of the interpreter threads. 
Use cases for this exist when there is a global variable provided by the 
underlying Python library and loading it multiple times can cause issues. Hence 
the API to register a shared module which can be used by all of the Interpreter 
Threads. 

- The operator submits to a work request queue and consumes from a response 
queue for each of the interpreter thread. There exists one request and one 
response queue per interpreter thread.

- The stragglers will get drained from the response queue for a previously 
submitted request queue. 

- The other reason why I chose to implement it this ways is also for some of 
the use case that I foresee in the ML scoring scenarios. In fraud systems, if I 
have a strict SLA to score a model, the main thread in the operator is not 
helping me implement this pattern at all. The caller to the Apex application 
will need to proceed if the scoring gets delayed due to whatever reason. 
However the scoring can continue on the interpreter thread and can be drained 
later ( It is just that the caller did not make use of this result but can 
still be persisted for operators consuming from the straggler port. 

- There are 3 output ports for this operator. DefaultOutputPort, stragglersPort 
and an errorPort. 

- Some libraries like Tensorflow can become really heavy. Tensorflow models can 
execute a tensorflow DAG as part of a model scoring implementation and hence I 
wanted to take the approach of a worker pool. Yes your point is valid if we 
wait for the stragglers to complete in a given window. The current 
implementation does not force to wait for all of the stragglers to complete. 
The stragglers are emitted only when there is a new tuple that is being 
processed. i.e. when a new tuple arrives for scoring , the straggler response 
queue is checked if there are any entries and if yes, the responses are emitted 
into the stragglerPort. This essentially means that there are situations when 
the straggler port is emitting the result for a request submitted in the 
previous window. This also implies that idempotency cannot be guaranteed across 
runs of the same input data. In fact all threaded implementations have this 
issue as ordering of the results is not guaranteed to be unique even within a 
given window ?

I can enforce a block/drain at the end of the window to force a completion 
basing on the feedback. 


Regards,
Ananth

> On 15 Dec 2017, at 4:21 am, Pramod Immaneni <pra...@datatorrent.com> wrote:
> 
> Hi Anath,
> 
> Sounds interesting and looks like you have put quite a bit of work on it.
> Might I suggest changing the title of 2260 to better fit your proposal and
> implementation, mainly so that there is differentiation from 2261.
> 
> I wanted to discuss the proposal to use multiple threads in an operator
> instance. Unless the execution threads are blocking for some sort of i/o
> why would it result in a noticeable performance difference compared to
> processing in operator thread and running multiple partitions of the
> operator in container local. By running the processing in a separate thread
> from the operator lifecycle thread you don't still get away from matching
> the incoming data throughput. The checkpoint will act as a time where you
> backpressure will start to materialize when the operator would have to wait
> for your background processing to complete to guarantee all data till the
> checkpoint is processed.
> 
> Thanks
> 
> 
> On Thu, Dec 14, 2017 at 2:20 AM, Ananth G &l

[Discuss] Design of the python execution operator

2017-12-14 Thread Ananth G
Hello All,

I would like to submit the design for the Python execution operator before I 
raise the pull request so that I can refine the implementation based on 
feedback. Could you please provide feedback on the design if any and I will 
raise the PR accordingly. 

- This operator is for the JIRA ticket raised here 
https://issues.apache.org/jira/browse/APEXMALHAR-2260 

- The operator embeds a python interpreter in the operator JVM process space 
and is not external to the JVM.
- The implementation is proposing the use of Java Embedded Python ( JEP ) given 
here https://github.com/ninia/jep 
- The JEP engine is under zlib/libpng license. Since this is an approved 
license under https://www.apache.org/legal/resolved.html#category-a 
 I am assuming it is ok 
for the community to approve the inclusion of this library  
- Python integration is a messy piece due to the nature of dynamic libraries. 
All python libraries need to be natively installed. This also means we will not 
be able bundle python libraries and dependencies as part of the build into the 
target JVM container. Hence this operator has the current limitation of the 
python binaries installed through an external process on all of the YARN nodes 
for now.
- The JEP maven dependency jar in the POM is a JNI wrapper around the dynamic 
library that is installed externally to the Apex installation process on all of 
the YARN nodes.
- Hope to take up https://issues.apache.org/jira/browse/APEXCORE-796 
 to solve this issue in the 
future.
- The python operator implementation can be extended to py4J based 
implementation ( as opposed to in-memory model like JEP ) in the future if 
required be. JEP is the implementation based on an in-memory design pattern.
- The python operator allows for 4 major API patterns
- Execute a method call by accepting parameters to pass to the interpreter
- Execute a python script as given in a file path
- Evaluate an expression and allows for passing of variables between the 
java code and the python in-memory interpreter bridge
- A handy method wherein a series of instructions can be passed in one 
single java call ( executed as a sequence of python eval instructions under the 
hood ) 
- Automatic garbage collection of the variables that are passed from java code 
to the in memory python interpreter
- Support for all major python libraries. Tensorflow, Keras, Scikit, xgboost. 
Preliminary tests for these libraries seem to work as per code here : 
https://github.com/ananthc/sampleapps/tree/master/apache-apex/apexjvmpython 
 
- The implementation allows for SLA based execution model. i.e. the operator is 
given a chance to execute the python code and if not complete within a time 
out, the operator code returns back null.
- A tuple that has become a straggler as per previous point will automatically 
be drained off to a different port so that downstream operators can still 
consume the straggler if they want to when the results arrive.
- Because of the nature of python being an interpreter and if a previous tuple 
is being still processed, there is chance of a back pressure pattern building 
up very quickly. Hence this operator works on the concept of a worker pool. The 
Python operator uses a configurable number of worker thread each of which embed 
the Python interpreter within their processing space. i.e. it is in fact a 
collection of python ink memory interpreters inside the Python operator 
implementation.
- The operator chooses one of the threads at runtime basing on their busy state 
thus allowing for back-pressure issues to be resolved automatically.
- There is a first class support for Numpy in JEP. Java arrays would be 
convertible to the Python Numpy arrays and vice versa and share the same memory 
addresses for efficiency reasons. 
- The base operator implements dynamic partitioning based on a thread 
starvation policy. At each checkpoint, it checks how much percentage of the 
requests resulted in starved threads and if the starvation exceeds a configured 
percentage, a new instance of the operator is provisioned for every such 
instance of the operator
- The operator provides the notion of a worker execution mode. There are two 
worker modes that are passed in each of the above calls from the user. ALL or 
ANY.  Because python interpreter is state based engine, a newly dynamically 
partitioned operator might not be in the exact state of the remaining 
operators. Hence the operator has this notion of worker execution mode. Any 
call ( any of the 4 calls mentioned above ) called with ALL execution mode will 
be executed on all the workers of the worker thread pool as well as the 
dynamically portioned instance whenever such an 

[ANNOUNCE] Apache Apex Malhar 3.8.0 released

2017-11-13 Thread Ananth G
Dear Community,

The Apache Apex community is pleased to announce release 3.8.0 of the Apex 
Malhar library.

Apache Apex is an enterprise grade big data-in-motion platform that unifies 
stream and batch processing. Apex was built for scalability and low-latency 
processing, high availability and operability. The Apex engine is supplemented 
by Malhar, the library of pre-built operators, including connectors that 
integrate with many existing technologies as sources and destinations, like 
message buses, databases, files or social media feeds.
Along with bug fixes, this release brings in improved support for Flume 
operator, bloom filter support for bucketing, better failure handling in kafka 
input operator and multi Hbase table output operator among others.

New features include support for Kudu, support for sort accumulation in 
Windowed Operator besides enhancing some example applications.

Changes: https://github.com/apache/apex-malhar/blob/v3.8.0/CHANGELOG.md 


The source release can be found at: 
http://www.apache.org/dyn/closer.lua/apex/apache-apex-malhar-3.8.0/apache-apex-malhar-3.8.0-source-release.tar.gz
 


or visit: http://apex.apache.org/downloads.html 


We welcome your help and feedback. For more information on the project and how 
to get involved, visit our website at: http://apex.apache.org/

Regards,
The Apache Apex community



[RESULT][VOTE] Apache Apex Masher 3.8.0 release candidate RC1

2017-11-10 Thread Ananth G
The vote is concluded and passes. Thanks everyone for voting and verifying the 
release.

binding +1 (3)

Vlad Rozov
Thomas Weise
Tushar Gosavi

non-binding +1 (1)

Bhupesh Chawda

No other votes.

Will complete the release activities in 12 hours from now.

Regards,
Ananth

Re: [VOTE] Apache Apex Malhar 3.8.0 release candidate RC1

2017-11-09 Thread Ananth G
Hello All,

A gentle reminder. Could you please vote for the release candidate so that we 
can proceed with the release. 

Regards,
Ananth 
> On 8 Nov 2017, at 3:39 am, Thomas Weise <t...@apache.org> wrote:
> 
> +1 (binding)
> 
> - verified signatures
> - run pi demo
> 
> The release candidate should come with staged javadoc (it will be needed to
> update the download page). Please see recent release threads, the vote
> example in the release instructions should probably be updated.
> 
> Thanks,
> Thomas
> 
> 
> 
> On Sun, Nov 5, 2017 at 10:23 AM, Ananth G <ananthg.a...@gmail.com> wrote:
> 
>> Dear Community,
>> 
>> Please vote on the following Apache Apex Malhar 3.8.0 release candidate.
>> 
>> This is a source release with binary artifacts published to Maven.
>> 
>> This release is based on Apex Core 3.6 and resolves 63 issues.
>> 
>> List of all issues fixed: https://issues.apache.org/
>> jira/secure/ReleaseNote.jspa?projectId=12318824=12340282 <
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>> projectId=12318824=12340282>
>> User documentation: http://apex.apache.org/docs/malhar-3.8/ <
>> http://apex.apache.org/docs/malhar-3.8/>
>> 
>> Staging directory: https://dist.apache.org/repos/
>> dist/dev/apex/apache-apex-malhar-3.8.0-RC1/ <https://dist.apache.org/
>> repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/>
>> Source zip:
>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-
>> malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip <
>> https://dist.apache.org/repos/dist/dev/apex/apache-
>> apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip>Source
>> tar.gz:
>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-
>> malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz <
>> https://dist.apache.org/repos/dist/dev/apex/apache-
>> apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz>
>> Maven staging repository:
>> https://repository.apache.org/content/repositories/orgapacheapex-1030/ <
>> https://repository.apache.org/content/repositories/orgapacheapex-1030/>
>> 
>> Git source:
>> https://github.com/apache/apex-malhar/tree/v3.8.0-RC1 <
>> https://github.com/apache/apex-malhar/tree/v3.8.0-RC1> (commit:
>> 552a0b7e50d1441db3e98104491f4abce65d506e)
>> 
>> PGP key:
>> http://pgp.mit.edu/pks/lookup?search=ananthg%40apache.org=index <
>> http://pgp.mit.edu/pks/lookup?search=anan...@apache.org=index>
>> KEYS file:
>> https://dist.apache.org/repos/dist/release/apex/KEYS <
>> https://dist.apache.org/repos/dist/release/apex/KEYS>
>> 
>> More information at:
>> http://apex.apache.org <http://apex.apache.org/>
>> 
>> Please try the release and vote; vote will be open for at least 72 hours.
>> 
>> [ ] +1 approve (and what verification was done)
>> [ ] -1 disapprove (and reason why)
>> 
>> http://www.apache.org/foundation/voting.html <http://www.apache.org/
>> foundation/voting.html>
>> 
>> How to verify release candidate:
>> 
>> http://apex.apache.org/verification.html <http://apex.apache.org/
>> verification.html>
>> 
>> Regards,
>> Ananth



Re: [VOTE] Apache Apex Malhar 3.8.0 release candidate RC1

2017-11-06 Thread Ananth G
Hello All,

I updated the signature file extensions appropriately. The only change is the 
upload of the SHA signature files with the new extension of “sha512” in the 
staging directory : 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/ 
<https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/> 

May I request everyone to vote on the release artefacts as mentioned in the 
mail thread below ? 

Some quick links from the original email below:

- List of all issues fixed: 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282
 
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282>

- User documentation: http://apex.apache.org/docs/malhar-3.8 
<http://apex.apache.org/docs/malhar-3.8>

- Staging directory: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1 
<https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1>

- Source zip: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip
 
<https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip>

- Source tar.gz: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz
 
<https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz>

- Maven staging repository: 
https://repository.apache.org/content/repositories/orgapacheapex-1030 
<https://repository.apache.org/content/repositories/orgapacheapex-1030>

- Git source: https://github.com/apache/apex-malhar/tree/v3.8.0-RC1 
<https://github.com/apache/apex-malhar/tree/v3.8.0-RC1>(commit: 
552a0b7e50d1441db3e98104491f4abce65d506e)

- PGP key: http://pgp.mit.edu/pks/lookup?search=ananthg%40apache.org=index 
<http://pgp.mit.edu/pks/lookup?search=anan...@apache.org=index>

- KEYS file: https://dist.apache.org/repos/dist/release/apex/KEYS 
<https://dist.apache.org/repos/dist/release/apex/KEYS>



Regards,
Ananth


> On 6 Nov 2017, at 7:25 am, Ananth G <ananthg.a...@gmail.com> wrote:
> 
> Thanks for the note Vlad. I will be able to push the additional .sha512 
> extension tonight.  
> 
> - Just to be clear, the release instructions do indicate a SHA-512 checksum 
> as part of the instruction but it is the filename that is not in agreement ? 
> ( The extension for the file ends in .sha instead of the recommended .sha-512 
> )
> - The below link from you suggests that there should be a .sha1 extension as 
> mandatory for historical reasons. Hence I will be creating this signature as 
> well ( apart from the 512 extension)
> - I am assuming that a new release candidate (RC2) is not required as only 
> the checksums are being regenerated ?   
> 
> I guess I have a few minor updates to the guidelines document which I will 
> create a separate pull request for which will include the above as well. 
> 
> I would be able to push the above changes tonight. ( another 12 hours from 
> now ). 
> 
> Regards,
> Ananth 
> 
>> On 6 Nov 2017, at 3:44 am, Vlad Rozov <vro...@apache.org> wrote:
>> 
>> Hi Ananth,
>> 
>> There is new requirement [1], [2] for Apache releases to use .sha512 
>> extension that is not yet reflected by Apex release guidelines. Can you 
>> please update extension.
>> 
>> Thank you,
>> 
>> Vlad
>> 
>> [1] http://www.apache.org/dev/release-distribution
>> [2] http://www.apache.org/dev/release-distribution#sigs-and-sums
>> 
>> On 11/5/17 02:23, Ananth G wrote:
>>> Dear Community,
>>> 
>>> Please vote on the following Apache Apex Malhar 3.8.0 release candidate.
>>> 
>>> This is a source release with binary artifacts published to Maven.
>>> 
>>> This release is based on Apex Core 3.6 and resolves 63 issues.
>>> 
>>> List of all issues fixed: 
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282
>>>  
>>> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282>
>>> User documentation: http://apex.apache.org/docs/malhar-3.8/ 
>>> <http://apex.apache.org/docs/malhar-3.8/>
>>> 
>>> Staging directory: 
>>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/ 
>>> <https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/>
>>> Source zip:
>>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip
>>>  
>>> <https://dist.apach

Re: [VOTE] Apache Apex Malhar 3.8.0 release candidate RC1

2017-11-05 Thread Ananth G
Thanks for the note Vlad. I will be able to push the additional .sha512 
extension tonight.  

- Just to be clear, the release instructions do indicate a SHA-512 checksum as 
part of the instruction but it is the filename that is not in agreement ? ( The 
extension for the file ends in .sha instead of the recommended .sha-512 )
- The below link from you suggests that there should be a .sha1 extension as 
mandatory for historical reasons. Hence I will be creating this signature as 
well ( apart from the 512 extension)
- I am assuming that a new release candidate (RC2) is not required as only the 
checksums are being regenerated ?   

I guess I have a few minor updates to the guidelines document which I will 
create a separate pull request for which will include the above as well. 

I would be able to push the above changes tonight. ( another 12 hours from now 
). 

Regards,
Ananth 

> On 6 Nov 2017, at 3:44 am, Vlad Rozov <vro...@apache.org> wrote:
> 
> Hi Ananth,
> 
> There is new requirement [1], [2] for Apache releases to use .sha512 
> extension that is not yet reflected by Apex release guidelines. Can you 
> please update extension.
> 
> Thank you,
> 
> Vlad
> 
> [1] http://www.apache.org/dev/release-distribution
> [2] http://www.apache.org/dev/release-distribution#sigs-and-sums
> 
> On 11/5/17 02:23, Ananth G wrote:
>> Dear Community,
>> 
>> Please vote on the following Apache Apex Malhar 3.8.0 release candidate.
>> 
>> This is a source release with binary artifacts published to Maven.
>> 
>> This release is based on Apex Core 3.6 and resolves 63 issues.
>> 
>> List of all issues fixed: 
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282
>>  
>> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282>
>> User documentation: http://apex.apache.org/docs/malhar-3.8/ 
>> <http://apex.apache.org/docs/malhar-3.8/>
>> 
>> Staging directory: 
>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/ 
>> <https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/>
>> Source zip:
>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip
>>  
>> <https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip>Source
>>  tar.gz:
>> https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz
>>  
>> <https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz>
>> Maven staging repository:
>> https://repository.apache.org/content/repositories/orgapacheapex-1030/ 
>> <https://repository.apache.org/content/repositories/orgapacheapex-1030/>
>> 
>> Git source:
>> https://github.com/apache/apex-malhar/tree/v3.8.0-RC1 
>> <https://github.com/apache/apex-malhar/tree/v3.8.0-RC1> (commit: 
>> 552a0b7e50d1441db3e98104491f4abce65d506e)
>> 
>> PGP key:
>> http://pgp.mit.edu/pks/lookup?search=ananthg%40apache.org=index 
>> <http://pgp.mit.edu/pks/lookup?search=anan...@apache.org=index>
>> KEYS file:
>> https://dist.apache.org/repos/dist/release/apex/KEYS 
>> <https://dist.apache.org/repos/dist/release/apex/KEYS>
>> 
>> More information at:
>> http://apex.apache.org <http://apex.apache.org/>
>> 
>> Please try the release and vote; vote will be open for at least 72 hours.
>> 
>> [ ] +1 approve (and what verification was done)
>> [ ] -1 disapprove (and reason why)
>> 
>> http://www.apache.org/foundation/voting.html 
>> <http://www.apache.org/foundation/voting.html>
>> 
>> How to verify release candidate:
>> 
>> http://apex.apache.org/verification.html 
>> <http://apex.apache.org/verification.html>
>> 
>> Regards,
>> Ananth
> 



[VOTE] Apache Apex Malhar 3.8.0 release candidate RC1

2017-11-05 Thread Ananth G
Dear Community, 

Please vote on the following Apache Apex Malhar 3.8.0 release candidate. 

This is a source release with binary artifacts published to Maven. 

This release is based on Apex Core 3.6 and resolves 63 issues. 

List of all issues fixed: 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318824=12340282
 

User documentation: http://apex.apache.org/docs/malhar-3.8/ 


Staging directory: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/ 

Source zip: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.zip
 
Source
 tar.gz: 
https://dist.apache.org/repos/dist/dev/apex/apache-apex-malhar-3.8.0-RC1/apache-apex-malhar-3.8.0-source-release.tar.gz
 

Maven staging repository: 
https://repository.apache.org/content/repositories/orgapacheapex-1030/ 


Git source: 
https://github.com/apache/apex-malhar/tree/v3.8.0-RC1 
 (commit: 
552a0b7e50d1441db3e98104491f4abce65d506e) 

PGP key: 
http://pgp.mit.edu/pks/lookup?search=ananthg%40apache.org=index 

KEYS file: 
https://dist.apache.org/repos/dist/release/apex/KEYS 


More information at: 
http://apex.apache.org 

Please try the release and vote; vote will be open for at least 72 hours. 

[ ] +1 approve (and what verification was done) 
[ ] -1 disapprove (and reason why) 

http://www.apache.org/foundation/voting.html 


How to verify release candidate: 

http://apex.apache.org/verification.html 


Regards,
Ananth

Re: Process to push the documentation into ASF-site before vote for a release

2017-11-05 Thread Ananth G
Thanks Thomas. 

Regards,
Ananth.

> On 5 Nov 2017, at 7:47 pm, Thomas Weise <t...@apache.org> wrote:
> 
> Done: http://apex.apache.org/docs/malhar-3.8/ 
> <http://apex.apache.org/docs/malhar-3.8/>
> 
> The docs just were not published (see link below the instructions that you
> executed).
> 
> 
> On Sun, Nov 5, 2017 at 3:20 AM, Ananth G <ananthg.a...@gmail.com 
> <mailto:ananthg.a...@gmail.com>> wrote:
> 
>> Hello All,
>> 
>> I am having an issue completing the last step of the release process
>> before a vote can be called for. My understanding is that the vote for a
>> release ( masher release 3.8.0 ) needs to provide a link for the
>> documentation as well. Hence I am trying to generate the doc links before a
>> vote email can be sent. Could anyone of you please guide me as to what is
>> the right way to push documentation?
>> 
>> The question I have is if the steps to update apex-site docs folder as
>> mentioned in the release process are sufficient as they stand today? I
>> copied the malhar build generated docs into the apex-site docs folder ( as
>> the steps in documentation suggest) and pushed the changes to
>> apache/apex-site repo with asf-site as the branch. This is not reflecting
>> the docs to be viewed by the browser with the correct release URL such as
>> this : http://apex.apache.org/docs/malhar-3.8/ <
>> http://apex.apache.org/docs/malhar-3.8/ 
>> <http://apex.apache.org/docs/malhar-3.8/>>
>> 
>> I am wondering if there is any other step that I missed ?
>> 
>> Regards,
>> Ananth



Process to push the documentation into ASF-site before vote for a release

2017-11-04 Thread Ananth G
Hello All,

I am having an issue completing the last step of the release process before a 
vote can be called for. My understanding is that the vote for a release ( 
masher release 3.8.0 ) needs to provide a link for the documentation as well. 
Hence I am trying to generate the doc links before a vote email can be sent. 
Could anyone of you please guide me as to what is the right way to push 
documentation? 

The question I have is if the steps to update apex-site docs folder as 
mentioned in the release process are sufficient as they stand today? I copied 
the malhar build generated docs into the apex-site docs folder ( as the steps 
in documentation suggest) and pushed the changes to apache/apex-site repo with 
asf-site as the branch. This is not reflecting the docs to be viewed by the 
browser with the correct release URL such as this : 
http://apex.apache.org/docs/malhar-3.8/ 
 

I am wondering if there is any other step that I missed ? 

Regards,
Ananth

Re: Malhar release 3.8.0

2017-11-04 Thread Ananth G
Just a note: Since all of the open JIRAs are resolved, I am proceeding to 
follow the steps in the release guidelines for the 3.8.0 release. 

I will send a note once the artefacts are ready for a voting process.

Regards,
Ananth

> On 4 Nov 2017, at 5:39 am, Vlad Rozov  wrote:
> 
> Sorry, I don't fully understand your point. Is the suggestion not to add 
> maven plugin and only rely on manual sanity check or do extra manual sanity 
> check on top of the plugin? Any reason not to go with the second approach?
> 
> Thank you,
> 
> Vlad
> 
> On 11/3/17 08:36, Justin Mclean wrote:
>> Hi,
>> 
>>> The JIRA mentioned is to unblock the release. IMO the license plugin should
>>> be taken up as separate activity.
>> JFYI - In my experience the license plugin is not going to catch everything 
>> and tit may produce false positives. The best way is to manually look at the 
>> release artefact and it’s dependancies.
>> 
>> Thanks,
>> Justin
> 



Re: [ANNOUNCE] New Apache Apex PMC: Tushar Gosavi

2017-11-04 Thread Ananth G
Congrats Tushar.

Regards,
Ananth
> On 4 Nov 2017, at 7:55 pm, vikram patil  wrote:
> 
> Congratulations Tushar !!
> 
> 
> 
> On Fri, Nov 3, 2017 at 11:57 PM, Ambarish Pande
>  wrote:
>> Congratulations Tushar!
>> 
>> On Fri, 3 Nov 2017 at 11:48 PM, Sanjay Pujare 
>> wrote:
>> 
>>> Congratulations, Tushar!
>>> 
>>> Sanjay
>>> 
>>> 
>>> On Fri, Nov 3, 2017 at 9:17 AM, Pramod Immaneni 
>>> wrote:
>>> 
 A bit delayed but nevertheless important announcement, Apache Apex PMC is
 pleased to announce Tushar Gosavi as a new PMC member.
 
 Tushar has been contributing to Apex from the beginning of the project
>>> and
 has been working on the codebase for over 3 years. He is among the few
>>> who
 have a wide breadth of contribution, including both core and malhar, from
 internal changes to user facing api, from input/output operators to
 components that support operators, and has a good overall understanding
>>> of
 the codebase and how it works.
 
 His salient contributions over the years are
 
   - Module support in Apex
   - Operator additions and improvements such as S3, File input and
>>> output,
   partitionable unique count and dynamic partitioning improvements
   - Initial WAL implementation from which subsequent implementations
>>> were
   derived for different use cases
   - Plugin support for Apex
   - Various bug fixes and improvements in both malhar and core that you
   can find in the JIRA
   - Participated in long-term project maintenance tasks such as
   refactoring operators and demos
   - Participated in important feature discussions
   - Reviewed and committed pull requests from contributors
   - Participated in conducting and teaching in an Apex workshop at a
   university and speaking at Apex conference organized by DataTorrent
 
 Conference talks & Presentations
 
   1. Presentations at VIIT and PICT Pune
   2. http://www.apexbigdata.com/pune-platform-talk-9.html
   3. http://www.apexbigdata.com/pune-integration-talk-4.html
   4. Webinar on Smart Partitioning with Apex. (
   https://www.youtube.com/watch?v=HCATB1zlLE4
   )
   5. Presented about customer use case at Pune Meetup in 2016
 
 Pramod for the Apache Apex PMC.
 
>>> 
>> --
>> 
>> ___
>> 
>> Ambarish Pande
>> 
>> Associate Software Engineer
>> 
>> E: ambar...@datatorrent.com | M: +91-9028293982
>> 
>> www.datatorrent.com  |  apex.apache.org



Re: [ANNOUNCE] New Apache Apex Committer: Ananth Gundabattula

2017-11-04 Thread Ananth G
Thanks all.


Regards,
Ananth
> On 4 Nov 2017, at 7:54 pm, vikram patil  wrote:
> 
> Congratulations Ananth.
> 
> 
> On Fri, Nov 3, 2017 at 11:50 PM, Sanjay Pujare  wrote:
>> Congratulations, Ananth
>> 
>> Sanjay
>> 
>> 
>> On Fri, Nov 3, 2017 at 1:50 AM, Thomas Weise  wrote:
>> 
>>> The Project Management Committee (PMC) for Apache Apex is pleased to
>>> announce Ananth Gundabattula as new committer.
>>> 
>>> Ananth has been contributing to the project for about a year. Highlights:
>>> 
>>> * Cassandra and Kudu operators with in-depth analysis/design work
>>> * Good collaboration, adherence to contributor guidelines and ownership of
>>> work
>>> * Work beyond feature focus such as fixing pre-existing test issues that
>>> impact CI
>>> * Presented at YOW Data and Dataworks Summit Australia
>>> * Enthusiast, contributes on his own time
>>> 
>>> Welcome, Ananth, and congratulations!
>>> Thomas, for the Apache Apex PMC.
>>> 



Re: checking dependencies for known vulnerabilities

2017-11-02 Thread Ananth G
My vote would be to break the build. This can then force a “whitelisting” 
configuration in the maven plugin to be created as part of the review process ( 
along with a new JIRA ticket ). 

The concern would then be to ensure that the community works towards a 
resolution of the JIRA. Not breaking the build for me is tech debt slipping 
without anyones notice.  Fixing the JIRA is a hygiene process which I believe 
cannot take a back burner as compared contributor welfare and would need 
commitments from the contributor ( and/or others). 

On a related note, looking at apache spark, there seems to be CVE listings 
which the spark community has taken care of as the releases progressed. 
http://www.cvedetails.com/vulnerability-list/vendor_id-45/product_id-38954/Apache-Spark.html
 

 . 

Regards,
Ananth


> On 3 Nov 2017, at 4:48 am, Sanjay Pujare  wrote:
> 
> I like this suggestion. Blocking the PR is too drastic. I also second
> Pramod's point (made elsewhere) that we should try to encourage
> contribution instead of discouraging it by resorting to drastic measures.
> If you institute drastic measures to achieve a desired effect (e.g. getting
> contributors to look into CVEs and build infrastructure issues) it can have
> the opposite effect of contributors losing interest.
> 
> Sanjay
> 
> 
> 
> On Wed, Nov 1, 2017 at 1:25 PM, Thomas Weise  wrote:
> 
>> Considering typical behavior, unless the CI build fails, very few will be
>> interested fixing the issues.
>> 
>> Perhaps if after a CI failure the issue can be identified as pre-existing,
>> we can whitelist and create a JIRA that must be addressed prior to the next
>> release?
>> 
>> 
>> On Wed, Nov 1, 2017 at 7:51 PM, Pramod Immaneni 
>> wrote:
>> 
>>> I would like to hear what others think.. at this point I am -1 on merging
>>> the change as is that would fail all PR builds when a matching CVE is
>>> discovered regardless of whether the PR was the cause of the CVE or not.
>>> 
>>> On Wed, Nov 1, 2017 at 12:07 PM, Vlad Rozov  wrote:
>>> 
 On 11/1/17 11:39, Pramod Immaneni wrote:
 
> On Wed, Nov 1, 2017 at 11:36 AM, Vlad Rozov 
>> wrote:
> 
> There is no independent build and the check is still necessary to
>>> prevent
>> new dependencies with CVE being introduced.
>> 
>> There isn't one today but one could be added. What kind of effort is
> needed.
> 
 After it is added, we can discuss whether it will make sense to move
>> the
 check to the newly created build. Even if one is added, the check needs
>>> to
 be present in the CI builds that verify PR, so it is in the right place
 already, IMO.
 
> 
> 
> Look at Malhar 3.8.0 thread. There are libraries from Category X
>> introduced as a dependency, so now instead of dealing with the issue
>>> when
>> such dependencies were introduced, somebody else needs to deal with
>> removing/fixing those dependencies.
>> 
>> Those were directly introduced in PRs. I am not against adding
>>> additional
> checks that verify the PR better.
> 
 Right and it would be much better to catch the problem at the time it
>> was
 introduced, but Category X list (as well as known CVE) is not static.
 
 
> 
> Thank you,
>> 
>> Vlad
>> 
>> 
>> On 11/1/17 11:21, Pramod Immaneni wrote:
>> 
>> My original concern still remains. I think what you have is valuable
>>> but
>>> would prefer that it be activated in an independent build that
>>> notifies
>>> the
>>> interested parties.
>>> 
>>> On Wed, Nov 1, 2017 at 11:13 AM, Vlad Rozov 
>>> wrote:
>>> 
>>> Any other concerns regarding merging the PR? By looking at the
>> active
>>> PRs
>>> 
 on the apex core the entire conversation looks to be at the moot
>>> point.
 
 Thank you,
 
 Vlad
 
 
 On 10/30/17 18:50, Vlad Rozov wrote:
 
 On 10/30/17 17:30, Pramod Immaneni wrote:
 
> On Sat, Oct 28, 2017 at 7:47 AM, Vlad Rozov 
> wrote:
> 
>> Don't we use unit test to make sure that PR does not break an
>> existing
>> 
>> functionality? For that we use CI environment that we do not
>>> control
>>> and do
>>> not introduce any changes to, but for example Apache
>>> infrastructure
>>> team
>>> may decide to upgrade Jenkins and that may impact Apex builds.
>> The
>>> same
>>> applies to CVE. It is there to prevent dependencies with severe
>>> vulnerabilities.
>>> 
>>> Infrastructure changes are quite different, IMO, from this
>>> proposal.
>>> 
>>> While

Re: Malhar release 3.8.0

2017-11-02 Thread Ananth G
May also get some thoughts what you think about the approach for using the 
maven license plugin as documented in the JIRA here : 
https://issues.apache.org/jira/browse/APEXMALHAR-2461 
<https://issues.apache.org/jira/browse/APEXMALHAR-2461> 

The TL/DR; version is the build can be made to break if a certain license is 
coming in the build path. However there are other issues that need to be 
considered before the plugin can be considered as part of the PR. Since 
embracing the plugin is a huge manual check effort for the first time 
integration, I would like to get some thoughts before I spend too much time on 
this approach and realise later that all of us do not agree with the approach. 


Regards,
Ananth 
> On 3 Nov 2017, at 5:57 am, Ananth G <ananthg.a...@gmail.com> wrote:
> 
> The following changes pass the build and tests in my local environment:
> 
> - Fixing the JSON schema validator with the apache version
> - Fixing the JSON core with the apache version 
> - Marking mysql connector jars to be optional 
> 
> @Pramod: The plugin allows for custom overriding for use cases you mentioned. 
> Example some jars do not include the license and hence license is null ( apex 
> jar is also under this category ) . Some other dependencies do not seem to 
> have an aligned license name.Ex: "Apache License 2.0" vs "apache 2”.  I am 
> currently exploring how far can I go to keep adding “excludes” as we 
> encounter such blocks. 
> 
> Regards,
> Ananth 
> 
> 
>> On 2 Nov 2017, at 7:44 am, Vlad Rozov <vro...@apache.org> wrote:
>> 
>> I would agree to consider demos/samples as optional components especially 
>> that they are not published to the Apache maven repository as Apex 
>> artifacts, but still would prefer that all libraries licensed under Category 
>> X are marked as optional in those modules.
>> 
>> For all artifacts that are published to maven including sql and contrib, the 
>> dependencies must be removed, upgraded/replaced or marked (as a last resort) 
>> as optional.
>> 
>> Thank you,
>> 
>> Vlad
>> 
>> On 11/1/17 13:06, Pramod Immaneni wrote:
>>> If we go by the strict definition of [1], I guess everything can be
>>> considered optional because each component in the library is pretty much
>>> independent. But if we look at it another way, the project can
>>> consider certain modules as important and main and others as optional. I
>>> think the pom.xml profile distinction shows one such intent. Like I said,
>>> the intent should be to fix the ones you identified but if it were not
>>> possible, I think modules like examples, contrib and other in the
>>> all-modules and not in the main profile could be considered optional for
>>> the licensing purposes.
>>> 
>>> 1.https://www.apache.org/legal/resolved.html#optional
>>> 
>>> On Wed, Nov 1, 2017 at 12:57 PM, Vlad Rozov <vro...@apache.org> wrote:
>>> 
>>>> IMO, it is only usage that may be optional, whether a module is included
>>>> into a profile that is not enabled by default does not define it's usage.
>>>> 
>>>> It is also possible to consider entire library as optional, somebody may
>>>> use only one operator from the entire library.
>>>> 
>>>> Thank you,
>>>> 
>>>> Vlad
>>>> 
>>>> 
>>>> On 11/1/17 12:49, Pramod Immaneni wrote:
>>>> 
>>>>> I was thinking of it more in terms of optional that Justin mentioned
>>>>> earlier.
>>>>> 
>>>>> On Wed, Nov 1, 2017 at 12:10 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>>> 
>>>>> It does not matter whether sql (and demos) is part of the main profile or
>>>>>> not. It is a source release, not a binary release and source includes all
>>>>>> profiles.
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Vlad
>>>>>> 
>>>>>> 
>>>>>> On 11/1/17 11:50, Pramod Immaneni wrote:
>>>>>> 
>>>>>> Vlad can you add this command to the release instructions and the
>>>>>>> committer
>>>>>>> guidelines. If we are unable to address this for this release, we can
>>>>>>> consider moving examples to all-modules, sql is already not in the main
>>>>>>> profile.
>>>>>>> 
>>>>>>> On Mon, Oct 30, 2017 at 7:23 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>&g

Re: Malhar release 3.8.0

2017-11-02 Thread Ananth G
The following changes pass the build and tests in my local environment:

 - Fixing the JSON schema validator with the apache version
 - Fixing the JSON core with the apache version 
 - Marking mysql connector jars to be optional 

@Pramod: The plugin allows for custom overriding for use cases you mentioned. 
Example some jars do not include the license and hence license is null ( apex 
jar is also under this category ) . Some other dependencies do not seem to have 
an aligned license name.Ex: "Apache License 2.0" vs "apache 2”.  I am currently 
exploring how far can I go to keep adding “excludes” as we encounter such 
blocks. 

Regards,
Ananth 


> On 2 Nov 2017, at 7:44 am, Vlad Rozov <vro...@apache.org> wrote:
> 
> I would agree to consider demos/samples as optional components especially 
> that they are not published to the Apache maven repository as Apex artifacts, 
> but still would prefer that all libraries licensed under Category X are 
> marked as optional in those modules.
> 
> For all artifacts that are published to maven including sql and contrib, the 
> dependencies must be removed, upgraded/replaced or marked (as a last resort) 
> as optional.
> 
> Thank you,
> 
> Vlad
> 
> On 11/1/17 13:06, Pramod Immaneni wrote:
>> If we go by the strict definition of [1], I guess everything can be
>> considered optional because each component in the library is pretty much
>> independent. But if we look at it another way, the project can
>> consider certain modules as important and main and others as optional. I
>> think the pom.xml profile distinction shows one such intent. Like I said,
>> the intent should be to fix the ones you identified but if it were not
>> possible, I think modules like examples, contrib and other in the
>> all-modules and not in the main profile could be considered optional for
>> the licensing purposes.
>> 
>> 1.https://www.apache.org/legal/resolved.html#optional
>> 
>> On Wed, Nov 1, 2017 at 12:57 PM, Vlad Rozov <vro...@apache.org> wrote:
>> 
>>> IMO, it is only usage that may be optional, whether a module is included
>>> into a profile that is not enabled by default does not define it's usage.
>>> 
>>> It is also possible to consider entire library as optional, somebody may
>>> use only one operator from the entire library.
>>> 
>>> Thank you,
>>> 
>>> Vlad
>>> 
>>> 
>>> On 11/1/17 12:49, Pramod Immaneni wrote:
>>> 
>>>> I was thinking of it more in terms of optional that Justin mentioned
>>>> earlier.
>>>> 
>>>> On Wed, Nov 1, 2017 at 12:10 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>> 
>>>> It does not matter whether sql (and demos) is part of the main profile or
>>>>> not. It is a source release, not a binary release and source includes all
>>>>> profiles.
>>>>> 
>>>>> Thank you,
>>>>> 
>>>>> Vlad
>>>>> 
>>>>> 
>>>>> On 11/1/17 11:50, Pramod Immaneni wrote:
>>>>> 
>>>>> Vlad can you add this command to the release instructions and the
>>>>>> committer
>>>>>> guidelines. If we are unable to address this for this release, we can
>>>>>> consider moving examples to all-modules, sql is already not in the main
>>>>>> profile.
>>>>>> 
>>>>>> On Mon, Oct 30, 2017 at 7:23 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>>>> 
>>>>>> The following command may help to identify dependencies:
>>>>>> 
>>>>>>> find . -name DEPENDENCIES -print | xargs grep -n License: | grep -vE
>>>>>>> "Apache|CDDL|MIT|BSD|ASF|Public Domain|Eclipse Public License|Mozilla
>>>>>>> Public|Common Public|apache.org"
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> 
>>>>>>> Vlad
>>>>>>> 
>>>>>>> On 10/28/17 20:19, Ananth G wrote:
>>>>>>> 
>>>>>>> Before we proceed with the release, could I please get some thoughts on
>>>>>>> 
>>>>>>>> the following JIRAs that need resolution. If we can move some of these
>>>>>>>> out
>>>>>>>> of 3.8.0 to the next release , then I can proceed with the release
>>>>>>>> instructions.
>>>>>>>> 
>>>>>>>> There are two JIRAs

Re: Malhar release 3.8.0

2017-11-01 Thread Ananth G
I was wondering if we can consider a maven plugin or some other approach in an 
automated way that might help us get to avoiding the current situation we have 
with respect to Category X license.. 


What are our thoughts on :

- Integrating with a VersionEye OR  equivalent stack wherein we use a version 
eye maven plugin to check for whitelisted / blacklisted licenses maintained in 
a version eye server 
- If answer is yes,  is there an ASF server that we can use for our builds ? 
- There seems to be a maven plugin that might do this but I have not used it 
before. Does anyone have any opinion on https://github.com/mrice/license-check 
<https://github.com/mrice/license-check> ? 

Also what is our policy for undeclared licenses in the dependencies ? There is 
the license-maven-plugin from codehaus that lists these as “THIRD-PARTY” 
dependencies and can generate a report ( but cannot be used to break a build if 
a category x license is introduced as a PR )


Regards,
Ananth 


> On 2 Nov 2017, at 6:10 am, Vlad Rozov <vro...@apache.org> wrote:
> 
> It does not matter whether sql (and demos) is part of the main profile or 
> not. It is a source release, not a binary release and source includes all 
> profiles.
> 
> Thank you,
> 
> Vlad
> 
> On 11/1/17 11:50, Pramod Immaneni wrote:
>> Vlad can you add this command to the release instructions and the committer
>> guidelines. If we are unable to address this for this release, we can
>> consider moving examples to all-modules, sql is already not in the main
>> profile.
>> 
>> On Mon, Oct 30, 2017 at 7:23 PM, Vlad Rozov <vro...@apache.org> wrote:
>> 
>>> The following command may help to identify dependencies:
>>> 
>>> find . -name DEPENDENCIES -print | xargs grep -n License: | grep -vE
>>> "Apache|CDDL|MIT|BSD|ASF|Public Domain|Eclipse Public License|Mozilla
>>> Public|Common Public|apache.org"
>>> 
>>> Thank you,
>>> 
>>> Vlad
>>> 
>>> On 10/28/17 20:19, Ananth G wrote:
>>> 
>>>> Before we proceed with the release, could I please get some thoughts on
>>>> the following JIRAs that need resolution. If we can move some of these out
>>>> of 3.8.0 to the next release , then I can proceed with the release
>>>> instructions.
>>>> 
>>>> There are two JIRAs that are marked 3.8.0 and not yet resolved:
>>>> 
>>>> - https://issues.apache.org/jira/browse/APEXMALHAR-2461 <
>>>> https://issues.apache.org/jira/browse/APEXMALHAR-2461> (This is the one
>>>> that Vlad raised below about Category X dependencies )
>>>> - https://issues.apache.org/jira/browse/APEXMALHAR-2498 <
>>>> https://issues.apache.org/jira/browse/APEXMALHAR-2498> (Kafka Tests
>>>> being flaky )
>>>> 
>>>> The following is marked as “In progress”
>>>> - https://issues.apache.org/jira/browse/APEXMALHAR-2462 <
>>>> https://issues.apache.org/jira/browse/APEXMALHAR-2462> : This I believe
>>>> was kept in progress by Thomas for some follow up tasks and hence I believe
>>>> we can move it to post 3.8.0 release ?
>>>> 
>>>> 
>>>> @Vlad: Regarding APEXMALHAR-2461, How are you generating the license
>>>> reports ? I have tried using license-maven-plugin ( from codehaus ) and it
>>>> does generate a report but there is nothing which provides a report based
>>>> on the violations ( and hence being forced to open each project under
>>>> examples and comparing it with the licenses list from the allowed licenses
>>>> link that you provided in the mailing list a few days back). Is there a
>>>> more optimal way to see the current list of violations in a concise way ?
>>>> 
>>>> Regards,
>>>> Ananth
>>>> 
>>>> On 27 Oct 2017, at 5:53 am, Tushar Gosavi <tus...@datatorrent.com> wrote:
>>>>> Hi Vlad,
>>>>> 
>>>>> As far as I remember, I had access to staging maven area while doing
>>>>> previous apex release. You will need to update .m2/settings.xml with
>>>>> apache
>>>>> credential to access the maven repository.
>>>>> 
>>>>> Regards,
>>>>> -Tushar.
>>>>> 
>>>>> 
>>>>> On Thu, Oct 26, 2017 at 11:02 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>>> 
>>>>> Please send your PGP public key to one of PMC members to be added to
>>>>>> KEYS.
>>>>>> I don't 

Re: Malhar release 3.8.0

2017-10-28 Thread Ananth G

Before we proceed with the release, could I please get some thoughts on the 
following JIRAs that need resolution. If we can move some of these out of 3.8.0 
to the next release , then I can proceed with the release instructions. 

There are two JIRAs that are marked 3.8.0 and not yet resolved:

- https://issues.apache.org/jira/browse/APEXMALHAR-2461 
<https://issues.apache.org/jira/browse/APEXMALHAR-2461> (This is the one that 
Vlad raised below about Category X dependencies ) 
- https://issues.apache.org/jira/browse/APEXMALHAR-2498 
<https://issues.apache.org/jira/browse/APEXMALHAR-2498> (Kafka Tests being 
flaky ) 

The following is marked as “In progress” 
- https://issues.apache.org/jira/browse/APEXMALHAR-2462 
<https://issues.apache.org/jira/browse/APEXMALHAR-2462> : This I believe was 
kept in progress by Thomas for some follow up tasks and hence I believe we can 
move it to post 3.8.0 release ? 


@Vlad: Regarding APEXMALHAR-2461, How are you generating the license reports ? 
I have tried using license-maven-plugin ( from codehaus ) and it does generate 
a report but there is nothing which provides a report based on the violations ( 
and hence being forced to open each project under examples and comparing it 
with the licenses list from the allowed licenses link that you provided in the 
mailing list a few days back). Is there a more optimal way to see the current 
list of violations in a concise way ? 

Regards,
Ananth

> On 27 Oct 2017, at 5:53 am, Tushar Gosavi <tus...@datatorrent.com> wrote:
> 
> Hi Vlad,
> 
> As far as I remember, I had access to staging maven area while doing
> previous apex release. You will need to update .m2/settings.xml with apache
> credential to access the maven repository.
> 
> Regards,
> -Tushar.
> 
> 
> On Thu, Oct 26, 2017 at 11:02 PM, Vlad Rozov <vro...@apache.org> wrote:
> 
>> Please send your PGP public key to one of PMC members to be added to KEYS.
>> I don't remember if only PMC have access to staging Apache maven, it may be
>> the case. Tushar, did you have write access to the staging Apache maven
>> when you did the release?
>> 
>> What do we do with https://issues.apache.org/jira/browse/APEXMALHAR-2461?
>> 
>> Thank you,
>> 
>> Vlad
>> 
>> 
>> On 10/25/17 15:28, Ananth G wrote:
>> 
>>> I would like to volunteer to be the release manager for this. Given I
>>> have not done this before I might have a few questions along the way in the
>>> mailing list.
>>> 
>>> A couple of questions regarding the release process:
>>> 
>>> - In the link https://apex.apache.org/release.html , in the section
>>> titled “Build and deploy release candidate” there is a mention of adding
>>> GPG keys.
>>> - Is it mandatory for the release manager gpg public key to be
>>> present in the list
>>> - If it is how do I get my key added to that list
>>> - In the same section of the above link there is a mention of configuring
>>> the server apache.staging.https in the maven settings file.
>>> - I am not able to reach this server ? Is this expected?
>>> - The userid and password to be configured are our committer ids
>>> ?
>>> 
>>> Regards
>>> Ananth
>>> 
>>> On 26 Oct 2017, at 4:04 am, Ananth G <ananthg.a...@gmail.com> wrote:
>>>> 
>>>> +1 for malhar release.
>>>> 
>>>> 
>>>> Regards,
>>>> Ananth
>>>> 
>>>> On 26 Oct 2017, at 3:20 am, Bhupesh Chawda <bhup...@datatorrent.com>
>>>>> wrote:
>>>>> 
>>>>> +1 for malhar release
>>>>> 
>>>>> ~ Bhupesh
>>>>> 
>>>>> 
>>>>> ___
>>>>> 
>>>>> Bhupesh Chawda
>>>>> 
>>>>> E: bhup...@datatorrent.com | Twitter: @bhupeshsc
>>>>> 
>>>>> www.datatorrent.com  |  apex.apache.org
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Oct 25, 2017 at 9:37 PM, Chinmay Kolhatkar <
>>>>> chin...@datatorrent.com>
>>>>> wrote:
>>>>> 
>>>>> +1.
>>>>>> 
>>>>>> - Chinmay.
>>>>>> 
>>>>>> On 25 Oct 2017 9:20 pm, "Chaitanya Chebolu" <chaita...@datatorrent.com
>>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>> +1 on new release.
>>>>>>> 
>

Re: Malhar release 3.8.0

2017-10-25 Thread Ananth G
I would like to volunteer to be the release manager for this. Given I have not 
done this before I might have a few questions along the way in the mailing list.

A couple of questions regarding the release process:

- In the link https://apex.apache.org/release.html , in the section titled 
“Build and deploy release candidate” there is a mention of adding GPG keys.
- Is it mandatory for the release manager gpg public key to be present 
in the list
- If it is how do I get my key added to that list
- In the same section of the above link there is a mention of configuring the 
server apache.staging.https in the maven settings file. 
- I am not able to reach this server ? Is this expected?
- The userid and password to be configured are our committer ids ?

Regards
Ananth

> On 26 Oct 2017, at 4:04 am, Ananth G <ananthg.a...@gmail.com> wrote:
> 
> +1 for malhar release. 
> 
> 
> Regards,
> Ananth
> 
>> On 26 Oct 2017, at 3:20 am, Bhupesh Chawda <bhup...@datatorrent.com> wrote:
>> 
>> +1 for malhar release
>> 
>> ~ Bhupesh
>> 
>> 
>> ___
>> 
>> Bhupesh Chawda
>> 
>> E: bhup...@datatorrent.com | Twitter: @bhupeshsc
>> 
>> www.datatorrent.com  |  apex.apache.org
>> 
>> 
>> 
>> On Wed, Oct 25, 2017 at 9:37 PM, Chinmay Kolhatkar <chin...@datatorrent.com>
>> wrote:
>> 
>>> +1.
>>> 
>>> - Chinmay.
>>> 
>>> On 25 Oct 2017 9:20 pm, "Chaitanya Chebolu" <chaita...@datatorrent.com>
>>> wrote:
>>> 
>>>> +1 on new release.
>>>> 
>>>> Thanks,
>>>> 
>>>>> On Wed, Oct 25, 2017 at 9:09 PM, Vlad Rozov <vro...@apache.org> wrote:
>>>>> 
>>>>> +1.
>>>>> 
>>>>> Thank you,
>>>>> 
>>>>> Vlad
>>>>> 
>>>>> 
>>>>>> On 10/25/17 08:21, Amol Kekre wrote:
>>>>>> 
>>>>>> +1 on a new malhar release.
>>>>>> 
>>>>>> Thks,
>>>>>> Amol
>>>>>> 
>>>>>> 
>>>>>> E:a...@datatorrent.com | M: 510-449-2606 | Twitter: @*amolhkekre*
>>>>>> 
>>>>>> www.datatorrent.com
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 24, 2017 at 9:12 PM, Tushar Gosavi <
>>> tus...@datatorrent.com>
>>>>>> wrote:
>>>>>> 
>>>>>> +1 on creating a new malhar release.
>>>>>>> 
>>>>>>> - Tushar.
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Oct 25, 2017 at 4:39 AM, Pramod Immaneni <
>>>> pra...@datatorrent.com
>>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> +1 on creating a new release. I, unfortunately, do not have the time
>>>>>>>> currently to participate in the release activities.
>>>>>>>> 
>>>>>>>> On Mon, Oct 23, 2017 at 7:15 PM, Thomas Weise <t...@apache.org>
>>> wrote:
>>>>>>>> 
>>>>>>>> The last release was back in March, there are quite a few JIRAs that
>>>>>>>>> 
>>>>>>>> have
>>>>>>> 
>>>>>>>> been completed since and should be released.
>>>>>>>>> 
>>>>>>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%
>>>>>>>>> 20%3D%203.8.0%20AND%20project%20%3D%20APEXMALHAR%20ORDER%
>>>>>>>>> 20BY%20status%20ASC
>>>>>>>>> 
>>>>>>>>> From looking at the list there is nothing that should stand in the
>>>> way
>>>>>>>>> 
>>>>>>>> of a
>>>>>>>> 
>>>>>>>>> release?
>>>>>>>>> 
>>>>>>>>> Also, once the release is out it would be a good opportunity to
>>>> effect
>>>>>>>>> 
>>>>>>>> the
>>>>>>>> 
>>>>>>>>> major version change.
>>>>>>>>> 
>>>>>>>>> Anyone interested to be the release manager?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Thomas
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> *Chaitanya*
>>>> 
>>>> Software Engineer
>>>> 
>>>> E: chaita...@datatorrent.com | Twitter: @chaithu1403
>>>> 
>>>> www.datatorrent.com  |  apex.apache.org
>>>> 
>>> 
> 


Re: Malhar release 3.8.0

2017-10-25 Thread Ananth G
+1 for malhar release. 


Regards,
Ananth

> On 26 Oct 2017, at 3:20 am, Bhupesh Chawda  wrote:
> 
> +1 for malhar release
> 
> ~ Bhupesh
> 
> 
> ___
> 
> Bhupesh Chawda
> 
> E: bhup...@datatorrent.com | Twitter: @bhupeshsc
> 
> www.datatorrent.com  |  apex.apache.org
> 
> 
> 
> On Wed, Oct 25, 2017 at 9:37 PM, Chinmay Kolhatkar 
> wrote:
> 
>> +1.
>> 
>> - Chinmay.
>> 
>> On 25 Oct 2017 9:20 pm, "Chaitanya Chebolu" 
>> wrote:
>> 
>>> +1 on new release.
>>> 
>>> Thanks,
>>> 
>>> On Wed, Oct 25, 2017 at 9:09 PM, Vlad Rozov  wrote:
>>> 
 +1.
 
 Thank you,
 
 Vlad
 
 
 On 10/25/17 08:21, Amol Kekre wrote:
 
> +1 on a new malhar release.
> 
> Thks,
> Amol
> 
> 
> E:a...@datatorrent.com | M: 510-449-2606 | Twitter: @*amolhkekre*
> 
> www.datatorrent.com
> 
> 
> On Tue, Oct 24, 2017 at 9:12 PM, Tushar Gosavi <
>> tus...@datatorrent.com>
> wrote:
> 
> +1 on creating a new malhar release.
>> 
>> - Tushar.
>> 
>> 
>> On Wed, Oct 25, 2017 at 4:39 AM, Pramod Immaneni <
>>> pra...@datatorrent.com
>>> 
>> wrote:
>> 
>> +1 on creating a new release. I, unfortunately, do not have the time
>>> currently to participate in the release activities.
>>> 
>>> On Mon, Oct 23, 2017 at 7:15 PM, Thomas Weise 
>> wrote:
>>> 
>>> The last release was back in March, there are quite a few JIRAs that
 
>>> have
>> 
>>> been completed since and should be released.
 
 https://issues.apache.org/jira/issues/?jql=fixVersion%
 20%3D%203.8.0%20AND%20project%20%3D%20APEXMALHAR%20ORDER%
 20BY%20status%20ASC
 
 From looking at the list there is nothing that should stand in the
>>> way
 
>>> of a
>>> 
 release?
 
 Also, once the release is out it would be a good opportunity to
>>> effect
 
>>> the
>>> 
 major version change.
 
 Anyone interested to be the release manager?
 
 Thanks,
 Thomas
 
 
 
>>> 
>>> 
>>> --
>>> 
>>> *Chaitanya*
>>> 
>>> Software Engineer
>>> 
>>> E: chaita...@datatorrent.com | Twitter: @chaithu1403
>>> 
>>> www.datatorrent.com  |  apex.apache.org
>>> 
>> 



Re: [DISCUSS] inactive PR

2017-09-23 Thread Ananth G
I would vote for dead PRs to be ideally closed. 

However, I was wondering if we are being too stringent on the timelines. The 
reason I raise this is in some of the previous pull requests I was told that 
the committer would be merging after waiting for a few days. Since the 
definition of few is not fixed, may I request  we define some timelines for the 
commit time windows as well so that we have sufficient gap between these two 
time windows ? 


Regards,
Ananth 
> On 23 Sep 2017, at 11:48 am, Vlad Rozov  wrote:
> 
> I'd suggest that a PR that lack an activity for more than a month is closed. 
> Any objections?
> 
> Thank you,
> 
> Vlad



Re: Partitioning information and current operators Id in setup()/activate() method.

2017-09-22 Thread Ananth G
Thanks a lot Thomas. It is indeed working as you mentioned. It was a bug in my 
code that was present in an edge condition/use case of a single Apex operator 
configuration. Due to the buggy code in the partition method when the number of 
Apex operators was configured to be one,  I was returning back the prototype 
operator ( the original instance ) without setting some operator specific 
internal flags. This was misunderstood by me as that the partition logic did 
not get triggered in the first place. 

Thanks for confirming.


Regards,
Ananth
> On 22 Sep 2017, at 1:33 am, Thomas Weise <t...@apache.org> wrote:
> 
> Implementing Partitioner should cover this. The partitioner is called when
> constructing the physical plan (at initialization) and then, in the case of
> dynamic partitioning, on demand.
> 
> As part of the partitioning logic you can set properties on the operator
> instances that are specific to the partition. An example for this is the
> Kafka input operator, where each Apex partition will be setup with the
> Kafka topic partition(s) it is responsible for.
> 
> Thanks,
> Thomas
> 
> 
> On Thu, Sep 21, 2017 at 6:36 AM, Ananth G <ananthg.a...@gmail.com> wrote:
> 
>> Hello All,
>> 
>> I was wondering if there is a good way to get the following two values in
>> an operator when the operator is in the activate/setup methods
>> 
>> 1. The total number of operator physical instances that are going to be
>> launched
>> 2. The ordinal position of the current operator in the current set of
>> physical operators
>> 
>> I am looking for this information so that I can implement a logic in the
>> operator wherein each operator takes a part of the responsibility . I can
>> get the above values when the dynamic partitioning interface is implemented
>> and the call back happens to the assign() method. But the issue is that the
>> assign() method does not seem to be invoked at the beginning but only when
>> the dynamic partitioning is being done ( In other words the assign() method
>> is not called until a certain time has passed since the beginning of the
>> launch of the operator. I would like the above values when the operator is
>> starting in the activate()/setup() methods.)
>> 
>> Any advice as to how these values can be obtained ?
>> 
>> Regards,
>> Ananth



Partitioning information and current operators Id in setup()/activate() method.

2017-09-21 Thread Ananth G
Hello All,

I was wondering if there is a good way to get the following two values in an 
operator when the operator is in the activate/setup methods

1. The total number of operator physical instances that are going to be launched
2. The ordinal position of the current operator in the current set of physical 
operators

I am looking for this information so that I can implement a logic in the 
operator wherein each operator takes a part of the responsibility . I can get 
the above values when the dynamic partitioning interface is implemented and the 
call back happens to the assign() method. But the issue is that the assign() 
method does not seem to be invoked at the beginning but only when the dynamic 
partitioning is being done ( In other words the assign() method is not called 
until a certain time has passed since the beginning of the launch of the 
operator. I would like the above values when the operator is starting in the 
activate()/setup() methods.)

Any advice as to how these values can be obtained ? 

Regards,
Ananth 

Re: [Design discussion] - Kudu Input operator

2017-08-24 Thread Ananth G
Hello Thomas,

Thanks for the additional comments. Replies inline marked [Ananth]>>>

Regards,
Ananth
> On 22 Aug 2017, at 2:10 pm, Thomas Weise <t...@apache.org> wrote:
> 
> -->
> 
> Thanks,
> Thomas
> 
> 
> On Sat, Aug 19, 2017 at 2:07 PM, Ananth G <ananthg.a...@gmail.com 
> <mailto:ananthg.a...@gmail.com>> wrote:
> 
>> Hello Thomas,
>> 
>> Replies in line marked [Ananth]>>
>> 
>> Apologies for a little bit more longer description as I think the
>> description needs more clarity.
>> 
>> Regards,
>> Ananth
>> 
>>> On 19 Aug 2017, at 11:10 am, Thomas Weise <t...@apache.org> wrote:
>>> 
>>> Hi Ananth,
>>> 
>>> Nice writeup, couple questions/comments inline ->
>>> 
>>> On Tue, Aug 15, 2017 at 2:02 PM, Ananth G <ananthg.a...@gmail.com
>> <mailto:ananthg.a...@gmail.com <mailto:ananthg.a...@gmail.com>>> wrote:
>>> 
>>>> Hello All,
>>>> 
>>>> The implementation for Apex Kudu Input Operator is ready for a pull
>>>> request. Before raising the pull request, I would like to get any inputs
>>>> regarding the design and incorporate any feedback before raising the
>> pull
>>>> request in the next couple of days for the following JIRA.
>>>> 
>>>> https://issues.apache.org/jira/browse/APEXMALHAR-2472 
>>>> <https://issues.apache.org/jira/browse/APEXMALHAR-2472> <
>> https://issues.apache.org/jira/browse/APEXMALHAR-2472 
>> <https://issues.apache.org/jira/browse/APEXMALHAR-2472>> <
>>>> https://issues.apache.org/jira/browse/APEXMALHAR-2472 <
>> https://issues.apache.org/jira/browse/APEXMALHAR-2472>>
>>>> 
>>>> The following are the main features that would be supported by the Input
>>>> operator:
>>>> 
>>>> - The input operator would be used to scan all or some rows of a single
>>>> kudu table.
>>>> - Each Kudu row is translated to a POJO for downstream operators.
>>>> - The Input operator would accept an SQL expression ( described in
>> detail
>>>> below) that would be parsed to generate the equivalent scanner code for
>> the
>>>> Kudu Table. This is because Kudu Table API does not support an SQL
>>>> expressions
>>>> - The SQL expression would have additional options that would help in
>>>> Apache Apex design patterns ( Ex: Sending a control tuple message after
>> a
>>>> query is successfully processed )
>>>> - The Input operator works on a continuous basis i.e. it would accept
>> the
>>>> next query once the current query is complete)
>>>> 
>>> 
>>> This means the operator will repeat the query to fetch newly added rows,
>>> similar to what the JDBC poll operator does, correct?
>> [Ananth]>> Yes.  All of this design is covered by the Abstract
>> implementation. In fact there is a default implementation of the abstract
>> operator that does exactly this.This default implementation operator is
>> called IncrementalStepScanInputOperator. This operator based on a
>> properties file can be used to implement the JDBC Poll operator
>> functionality using a timestamp column as the incremental step value.
>> 
>> The design however does not limit us to only this pattern but can
>> accomodate other patterns as well. Here is what I want to add in this
>> context:
>>- Additional pattern can include a “time travel pattern”. Since Kudu
>> is an MVCC engine ( and if appropriately configured ) , we can use this
>> operator to answer question like “ Can I stream the entire or subset of the
>> kudu table at times 1 AM , 2 AM , 3 AM ..“ Of today even though the current
>> time could be 6 P.M. ( This is enabled by specifying the READ_SNAPSHOT_TIME
>> which is a supported option of the SQL grammar we are enabling for this
>> operator )
>> 
> 
> So this could be used to do event time based processing based on the
> snapshot time (without a timestamp column)?
> 

 [Ananth]>>> Yes. That is correct. 

> 
>>- Another interesting pattern is when the next query has got no
>> correlation with a previous query . Example use cases can be say using
>> Apex-cli equivalent or more possible future use case like Apache Zeppelin
>> integration. A query comes in ad-hoc and the values can be streamed from
>> the current incoming expression i.e. when we want to enable interactive
>> query based streaming.
>> 
>>> 

Re: [VOTE] Major version change for Apex Library (Malhar)

2017-08-22 Thread Ananth G
+1 for option 2 and second vote for option 1

Have we finalized the library name ? Going from Apex-malhar 3.7 to 
Apex-malhar-1.0 would be counter intuitive. Also it would be great if we have 
an agreed process to mark an operator from @evolving to stable version given we 
are trying to address this as well as part of the proposal

Regards
Ananth

> On 22 Aug 2017, at 11:40 am, Thomas Weise  wrote:
> 
> +1 for option 2 (second vote +1 for option 1)
> 
> 
>> On Mon, Aug 21, 2017 at 6:39 PM, Thomas Weise  wrote:
>> 
>> This is to formalize the major version change for Malhar discussed in [1].
>> 
>> There are two options for major version change. Major version change will
>> rename legacy packages to org.apache.apex sub packages while retaining file
>> history in git. Other cleanup such as removing deprecated code is also
>> expected.
>> 
>> 1. Version 4.0 as major version change from 3.x
>> 
>> 2. Version 1.0 with simultaneous change of Maven artifact IDs
>> 
>> Please refer to the discussion thread [1] for reasoning behind both of the
>> options.
>> 
>> Please vote on both options. Primary vote for your preferred option,
>> secondary for the other. Secondary vote can be used when counting primary
>> vote alone isn't conclusive.
>> 
>> Vote will be open for at least 72 hours.
>> 
>> Thanks,
>> Thomas
>> 
>> [1] https://lists.apache.org/thread.html/bd1db8a2d01e23b0c0ab98a785f6ee
>> 9492a1ac9e52d422568a46e5f3@%3Cdev.apex.apache.org%3E
>> 


Re: [Design discussion] - Kudu Input operator

2017-08-19 Thread Ananth G
Hello Thomas,

Replies in line marked [Ananth]>> 

Apologies for a little bit more longer description as I think the description 
needs more clarity. 

Regards,
Ananth

> On 19 Aug 2017, at 11:10 am, Thomas Weise <t...@apache.org> wrote:
> 
> Hi Ananth,
> 
> Nice writeup, couple questions/comments inline ->
> 
> On Tue, Aug 15, 2017 at 2:02 PM, Ananth G <ananthg.a...@gmail.com 
> <mailto:ananthg.a...@gmail.com>> wrote:
> 
>> Hello All,
>> 
>> The implementation for Apex Kudu Input Operator is ready for a pull
>> request. Before raising the pull request, I would like to get any inputs
>> regarding the design and incorporate any feedback before raising the pull
>> request in the next couple of days for the following JIRA.
>> 
>> https://issues.apache.org/jira/browse/APEXMALHAR-2472 
>> <https://issues.apache.org/jira/browse/APEXMALHAR-2472> <
>> https://issues.apache.org/jira/browse/APEXMALHAR-2472 
>> <https://issues.apache.org/jira/browse/APEXMALHAR-2472>>
>> 
>> The following are the main features that would be supported by the Input
>> operator:
>> 
>> - The input operator would be used to scan all or some rows of a single
>> kudu table.
>> - Each Kudu row is translated to a POJO for downstream operators.
>> - The Input operator would accept an SQL expression ( described in detail
>> below) that would be parsed to generate the equivalent scanner code for the
>> Kudu Table. This is because Kudu Table API does not support an SQL
>> expressions
>> - The SQL expression would have additional options that would help in
>> Apache Apex design patterns ( Ex: Sending a control tuple message after a
>> query is successfully processed )
>> - The Input operator works on a continuous basis i.e. it would accept the
>> next query once the current query is complete)
>> 
> 
> This means the operator will repeat the query to fetch newly added rows,
> similar to what the JDBC poll operator does, correct?
[Ananth]>> Yes.  All of this design is covered by the Abstract implementation. 
In fact there is a default implementation of the abstract operator that does 
exactly this.This default implementation operator is called 
IncrementalStepScanInputOperator. This operator based on a properties file can 
be used to implement the JDBC Poll operator functionality using a timestamp 
column as the incremental step value. 

The design however does not limit us to only this pattern but can accomodate 
other patterns as well. Here is what I want to add in this context: 
- Additional pattern can include a “time travel pattern”. Since Kudu is an 
MVCC engine ( and if appropriately configured ) , we can use this operator to 
answer question like “ Can I stream the entire or subset of the kudu table at 
times 1 AM , 2 AM , 3 AM ..“ Of today even though the current time could be 6 
P.M. ( This is enabled by specifying the READ_SNAPSHOT_TIME which is a 
supported option of the SQL grammar we are enabling for this operator )
- Another interesting pattern is when the next query has got no correlation 
with a previous query . Example use cases can be say using Apex-cli equivalent 
or more possible future use case like Apache Zeppelin integration. A query 
comes in ad-hoc and the values can be streamed from the current incoming 
expression i.e. when we want to enable interactive query based streaming.

> 
> - The operator will work in a distributed fashion for the input query. This
>> essentially means for a single input query, the scan work is distributed
>> among all of the physical instances of the input operator.
>> - Kudu splits a table into chunks of data regions called Tablets. The
>> tablets are replicated and partitioned  (range and hash partitions are
>> supported ) in Kudu according to the Kudu Table definition. The operator
>> allows partitioning of the Input Operator to be done in 2 ways.
>>- Map many Kudu Tablets to one partition of the Apex Kudu Input
>> operator
>>- One Kudu Tablet maps to one partition of the Apex Kudu Input
>> operator
>> - The partitioning does not change on a per query basis. This is because
>> of the complex use cases that would arise. For example, if the query is
>> touching only a few rows before the next query is accepted, it would result
>> in a lot of churn in terms of operator serialize/deserialze, YARN
>> allocation requests etc. Also supporting per query partition planning leads
>> to possibly very complex implementation and poor resource usage as all
>> physical instances of the operator have to wait for its peers to complete
>> its scan and wait for next checkpoint to get repartitioned.
>> 
> 
> Ag

[Design discussion] - Kudu Input operator

2017-08-15 Thread Ananth G
Hello All,

The implementation for Apex Kudu Input Operator is ready for a pull request. 
Before raising the pull request, I would like to get any inputs regarding the 
design and incorporate any feedback before raising the pull request in the next 
couple of days for the following JIRA.

https://issues.apache.org/jira/browse/APEXMALHAR-2472 


The following are the main features that would be supported by the Input 
operator:

- The input operator would be used to scan all or some rows of a single kudu 
table.
- Each Kudu row is translated to a POJO for downstream operators. 
- The Input operator would accept an SQL expression ( described in detail 
below) that would be parsed to generate the equivalent scanner code for the 
Kudu Table. This is because Kudu Table API does not support an SQL expressions 
- The SQL expression would have additional options that would help in Apache 
Apex design patterns ( Ex: Sending a control tuple message after a query is 
successfully processed )
- The Input operator works on a continuous basis i.e. it would accept the next 
query once the current query is complete)
- The operator will work in a distributed fashion for the input query. This 
essentially means for a single input query, the scan work is distributed among 
all of the physical instances of the input operator.
- Kudu splits a table into chunks of data regions called Tablets. The tablets 
are replicated and partitioned  (range and hash partitions are supported ) in 
Kudu according to the Kudu Table definition. The operator allows partitioning 
of the Input Operator to be done in 2 ways. 
- Map many Kudu Tablets to one partition of the Apex Kudu Input operator
- One Kudu Tablet maps to one partition of the Apex Kudu Input operator
- The partitioning does not change on a per query basis. This is because of the 
complex use cases that would arise. For example, if the query is touching only 
a few rows before the next query is accepted, it would result in a lot of churn 
in terms of operator serialize/deserialze, YARN allocation requests etc. Also 
supporting per query partition planning leads to possibly very complex 
implementation and poor resource usage as all physical instances of the 
operator have to wait for its peers to complete its scan and wait for next 
checkpoint to get repartitioned.
- The partitioner splits the work load of a single query in a round robin 
fashion. After a query plan is generated , each scan token range is distributed 
equally among the physical operator instances.
- The operator allows for two modes of scanning for an application ( Cannot be 
changed on a per query basis ) 
- Consistent Order scanner - only one tablet scan thread is active at 
any given instance of time for a given query
- Random Order scanner - Many threads are active to scan Kudu tablets 
in parallel
- As can be seen, Consistent order scanner would be slower but would help in 
better “exactly once” implementations if the correct method is overridden in 
the operator.
- The operator introduces the DisruptorBlockingQueue for a low latency buffer 
management. LMAX disruptor library was considered and based on some other 
discussion threads on other Apache projects, settled on the ConversantMedia 
implementation of the Disruptor Blocking queue. This blocking queue is used 
when the kudu scanner thread wants to send the scanned row into the input 
operators main thread emitTuples() call.
- The operator allows for exactly once semantics if the user specifies the 
logic for reconciling a possible duplicate row in situations when the operator 
is resuming from a checkpoint. This is done by overriding a method that returns 
a boolean ( true to emit the tuple and false to suppress the tuple ) when the 
operating is working in the reconciling window phase. As can be seen, this 
reconciling phase is only active at the max for one window.
- The operator uses the FSWindowManager to manage metadata at the end of every 
window. From resumption at a checkpoint, the operator will still scan the Kudu 
tablets but simply not emit all rows that were already streamed downstream. 
Subsequently when the operator is in the reconciling window, the method 
described above is invoked to allow for duplicates filter. After this 
reconciling window, the operator works in the normal mode of operation.
- The following are the additional configurable aspects of the operator
- Max tuples per window
- Spin policy and the buffer size for the Disruptor Blocking Queue
- Mechanism to provide custom control tuples if required
- Setting the number of Physical operator instances via the API if 
required. 
- Setting the fault Tolerance. If fault tolerant , an alternative 
replica of the Kudu tablet is picked up for scanning if the initial tablet 
fails for whatever reason. However this slows down the scan throughput. Hence 
it is 

Re: Difference between setup() and activate()

2017-08-03 Thread Ananth G
Thanks a lot Vlad , Pramod and Sanjay. This clarifies the differences for me.

Regards
Ananth

> On 4 Aug 2017, at 8:37 am, Pramod Immaneni <pra...@datatorrent.com> wrote:
> 
> Yes activate is called closer to start of tuple processing as far as apex
> is concerned, so if you are doing things like writing an input operator
> that does asynchronous processing where you will start receiving data as
> soon as you open a connection to your external source it is better to do it
> in activate to reduce latency and buffer build up.
> 
>> On Thu, Aug 3, 2017 at 3:07 PM, Vlad Rozov <v.rozo...@gmail.com> wrote:
>> 
>> Correct, both setup() and activate() are called when an operator is
>> restored from a checkpoint. When an operator is restored from a checkpoint
>> it is considered to be a new instance/deployment of an operator with it's
>> state reset to a checkpoint. In this case Apex core gives an operator a
>> chance to initialize transient fields both in setup() or activate().
>> 
>> I am not aware of any use case where platform will go through
>> activate/deactivate cycle without setup/teardown, but such code path may be
>> introduced in the future (for example it may be used to manage an input
>> operator with high emit rate). It is better not to make any assumptions on
>> how many times activate/deactivate may be called.
>> 
>> Currently the main difference between setup() and activate() is described
>> in the java doc for ActivationListener:
>> 
>> * An example of where one would consider implementing ActivationListener
>> is an * input operator which wants to consume a high throughput stream.
>> Since there is * typically at least a few hundreds of milliseconds between
>> the time the setup method * is called and the first window, you would want
>> to place the code to activate the * stream inside activate instead of setup.
>> 
>> 
>> My recommendation is to use setup() to initialize transient fields unless
>> you need to deal with the above case.
>> 
>> Thank you,
>> 
>> Vlad
>> 
>> 
>>> On 8/2/17 13:31, Ananth G wrote:
>>> 
>>> Hello Vlad,
>>> 
>>> Thanks for your response.
>>> 
>>> Do you refer to restoring from a checkpoint as serialize/deserialize
>>>>> cycles?
>>>>> 
>>>> Yes.
>>> 
>>> In case of restoring from a checkpoint (deserialization) setup() is a
>>>>> part of a redeployment request, AFAIK.
>>>>> 
>>>> This sounds a bit in contradiction to the response from Sanjay in the
>>> mail thread below. I tried to quickly glance in the apex-core code and it
>>> looks like both are being called ( Perhaps I am entirely wrong on this as
>>> it was only a quick scan). I was referring to the code in
>>> StreamingContainer.java in the engine package and the method called
>>> deploy().
>>> 
>>> 
>>> Please see ActivationListener javadoc for details when it is necessary to
>>>>> use activate() vs setup().
>>>>> 
>>>> I had to raise this question in the mail after going through the
>>> javadoc. The javadoc is a bit cryptic in this scenario of
>>> serialise/deserialize. Also the javadoc is not clear as to what we meant by
>>> activate/deactivate being called multiple times whereas setup is called
>>> once in a lifetime of the operator. If the setup is called once in lifetime
>>> of an operator per javadoc, did it mean once in the lifetime of the JVM
>>> instantiating via the constructor or across the deserialise cycles of the
>>> passivated operator state ? If it is once across all passivated instances
>>> of the operator, then setup() would not be called multiple times and hence
>>> not a great location for transient variables ? If setup() is called across
>>> deserialise cycles, then I find it more confusing as to why we need setup()
>>> and activate() methods almost having the same functionality.
>>> 
>>> Thoughts ?
>>> 
>>> 
>>> Regards,
>>> Ananth
>>> 
>>> 
>>>> On 1 Aug 2017, at 3:38 am, Vlad Rozov <v.ro...@datatorrent.com> wrote:
>>>> 
>>>> Do you refer to restoring from a checkpoint as serialize/deserialize
>>>> cycles? There are no calls to setup/teardown and/or activate/deactivate
>>>> during checkpointing/serialization. In case of restoring from a checkpoint
>>>> (deserialization) setup() is a part of a redeployment request, AFAIK. The
>>>> best

Re: Difference between setup() and activate()

2017-08-02 Thread Ananth G
Hello Vlad,

Thanks for your response. 

>>Do you refer to restoring from a checkpoint as serialize/deserialize cycles?
Yes. 

>>In case of restoring from a checkpoint (deserialization) setup() is a part of 
>>a redeployment request, AFAIK. 
This sounds a bit in contradiction to the response from Sanjay in the mail 
thread below. I tried to quickly glance in the apex-core code and it looks like 
both are being called ( Perhaps I am entirely wrong on this as it was only a 
quick scan). I was referring to the code in StreamingContainer.java in the 
engine package and the method called deploy().  


>>Please see ActivationListener javadoc for details when it is necessary to use 
>>activate() vs setup().
I had to raise this question in the mail after going through the javadoc. The 
javadoc is a bit cryptic in this scenario of serialise/deserialize. Also the 
javadoc is not clear as to what we meant by activate/deactivate being called 
multiple times whereas setup is called once in a lifetime of the operator. If 
the setup is called once in lifetime of an operator per javadoc, did it mean 
once in the lifetime of the JVM instantiating via the constructor or across the 
deserialise cycles of the passivated operator state ? If it is once across all 
passivated instances of the operator, then setup() would not be called multiple 
times and hence not a great location for transient variables ? If setup() is 
called across deserialise cycles, then I find it more confusing as to why we 
need setup() and activate() methods almost having the same functionality. 

Thoughts ?  


Regards,
Ananth 


> On 1 Aug 2017, at 3:38 am, Vlad Rozov <v.ro...@datatorrent.com> wrote:
> 
> Do you refer to restoring from a checkpoint as serialize/deserialize cycles? 
> There are no calls to setup/teardown and/or activate/deactivate during 
> checkpointing/serialization. In case of restoring from a checkpoint 
> (deserialization) setup() is a part of a redeployment request, AFAIK. The 
> best answer to question 3 is it depends. In most cases using setup() to 
> resolve all transient field is as good as doing that in activate(). Please 
> see ActivationListener javadoc for details when it is necessary to use 
> activate() vs setup().
> 
> Thank you,
> 
> Vlad
> 
> On 7/29/17 19:58, Sanjay Pujare wrote:
>> The Javadoc comment
>> for com.datatorrent.api.Operator.ActivationListener  (in
>> https://github.com/apache/apex-core/blob/master/api/src/main/java/com/datatorrent/api/Operator.java)
>> should hopefully answer your questions.
>> 
>> Specifically:
>> 
>> 1. No, setup() is called only once in the entire lifetime (
>> http://apex.apache.org/docs/apex/operator_development/#setup-call)
>> 
>> 2. Yes. When an operator is "activated" - first time in its life or
>> reactivation after a failover -  actuvate() is called before the first
>> beginWindow() is called.
>> 
>> 3. Yes.
>> 
>> 
>> On Sun, Jul 30, 2017 at 12:18 AM, Ananth G <ananthg.a...@gmail.com> wrote:
>> 
>>> Hello All,
>>> 
>>> I was looking at the documentation and could not get a clear distinction
>>> of behaviours for setup() and activate() during scenarios when an operator
>>> is passivated ( ex: application shutdown, repartition use cases ) and being
>>> brought back to life again. Could someone from the community advise me on
>>> the following questions ?
>>> 
>>> 1. Is setup() called in these scenarios (serialize/deserialize cycles) as
>>> well ?
>>> 
>>> 2. I am assuming activate() is called in these scenarios ? - The javadoc
>>> for activation states that the activate() can be called multiple times (
>>> without explicitly stating why ) and my assumption is that it is because of
>>> these scenarios.
>>> 
>>> 3. If setup() is only called once during the lifetime of an operator , is
>>> it fair to assume that activate() is the best place to resolve all of the
>>> transient fields of an operator ?
>>> 
>>> 
>>> Regards,
>>> Ananth
> 



Difference between setup() and activate()

2017-07-29 Thread Ananth G
Hello All,

I was looking at the documentation and could not get a clear distinction of 
behaviours for setup() and activate() during scenarios when an operator is 
passivated ( ex: application shutdown, repartition use cases ) and being 
brought back to life again. Could someone from the community advise me on the 
following questions ? 

1. Is setup() called in these scenarios (serialize/deserialize cycles) as well 
? 

2. I am assuming activate() is called in these scenarios ? - The javadoc for 
activation states that the activate() can be called multiple times ( without 
explicitly stating why ) and my assumption is that it is because of these 
scenarios. 

3. If setup() is only called once during the lifetime of an operator , is it 
fair to assume that activate() is the best place to resolve all of the 
transient fields of an operator ? 


Regards,
Ananth 

Re: Image processing library

2017-05-12 Thread Ananth G
 I guess the use cases as documented look really compelling. There might be 
more comments from code review perspective and below is more from a use case 
perspective only.

I was wondering if you have any latency measurements for the tests you ran. 

If the image processing calls ( in the process function overridden from the 
Toolkit class ) are time consuming it might not be an ideal use case for a 
streaming engine? A very old "blog" (2012)  talks about latencies anywhere 
between tens of milliseconds to almost a second depending on the use case and 
image size. Of course there were hardware improvements and those numbers might 
no longer hold good and hence the question (of course the latencies depend on 
hardware being used as well ) 

This brings me to the next question in general about Apex to the community : 
what is considered an acceptable tolerance level in terms of latencies for 
streaming compute engine like Apex. Is there a way to tune the acceptable 
tolerance level depending on the use case ? I keep reading from the mailing 
lists that the aspect of tuple processing is part of the main thread and hence 
should be as fast as possible. 

Regards
Ananth

> On 12 May 2017, at 9:05 pm, Aditya gholba  wrote:
> 
> Hello,
> I have been working on an image processing library for Malhar and few of
> the operators are ready. I would like to merge them in Malhar contrib. You
> can read about the operators and the applications I have created so far
> here.
> 
> 
> Link to my GitHub 
> 
> All suggestions and opinions are welcome.
> 
> 
> Thanks,
> Aditya.


Re: One Year Anniversary of Apache Apex

2017-04-26 Thread Ananth G
Congratulations all..

I was wondering if anyone could point me to docs/examples of what Thomas
meant in his blog by stating that "Iterative processing is now supported by
the engine to process loop based patterns for ML" ? Does this mean a tuple
can flow multiple times through the same operator ? If yes, that is a very
interesting feature ( perhaps unique to Apex as compared to Flink, Spark
engines ) and could be a building block for more patterns on top of Apex
Engine and hence the question.

Regards,
Ananth

On Wed, Apr 26, 2017 at 2:23 PM, Shubham Pathak 
wrote:

> Nice blog Thomas!!
> Congratulations to the community!!
>
> Thanks and Regards,
>
>
>
> ___
>
> Shubham Pathak
>
> Software Developer
>
> E: shub...@datatorrent.com | M: +91-9823345968
>
>  www.datatorrent.com   |  apex.apache.org
>
>
> On Wed, Apr 26, 2017 at 9:11 AM, Bhupesh Chawda 
> wrote:
>
> > Congratulations to the community!!
> >
> > ~ Bhupesh
> >
> >
> > ___
> >
> > Bhupesh Chawda
> >
> > E: bhup...@datatorrent.com | Twitter: @bhupeshsc
> >
> > www.datatorrent.com  |  apex.apache.org
> >
> >
> >
> > On Tue, Apr 25, 2017 at 8:44 PM, Thomas Weise  wrote:
> >
> > > Hi,
> > >
> > > It's been one year for Apache Apex as top level project,
> congratulations
> > to
> > > the community!
> > >
> > > I wrote this blog to reflect and look ahead:
> > >
> > > http://www.atrato.io/blog/2017/04/25/one-year-apex/
> > >
> > > Your comments and suggestions are welcome.
> > >
> > > Thanks,
> > > Thomas
> > >
> >
>


Checkstyle policies for auto generated code

2017-04-24 Thread Ananth G
Hello All,


I have run into a dilemma and would like to know what our policy is to deal 
with the following situation. 

As part of the implementation for Kudu Input operator 
(https://issues.apache.org/jira/browse/APEXMALHAR-2472 
) , I will be using 
Antlr4 as the parser tool to parse a line of string as an SQL equivalent 
statement to represent the set of tuples that will be streamed out of Kudu 
store to the downstream operators. 

I will post the design once a few things are finalised in a separate mailing 
thread and this mail is more about Checkstyle and Auto generated code from 
tools like Antlr4. 

The design involves in writing a grammar file and let the maven tool generate 
the parser and related code as .java files as part of the build process. We 
only keep maintaining the grammar “.g4” file as part of the repository checkins 
as Kudu functionality evolves. However this brings me to the situation wherein 
the check style fails for the classes that are autogenerated. Following are the 
three options that I think we have and would like to get thoughts on what is 
the best way to go forward.

Option 1: We let the autogenerated code generate code in the 
"target/generated-sources” path. This is the default for the maven antler 
plugin. This however does not pass check style maven plugin as check style 
plugin does check styles for auto-generated code as well. The fix for this is 
to modify check style plugin to only look at “src/“ folder paths as opposed 
“compiled sources”. This works from a build perspective but the drawback is 
that IDEs will not include the “target/generated-sources” for class resolution. 
IDEs do have plugins to resolve this error code but might be considered irksome 
by the developer community. 

Option 2: We let Antlr4 code-gen to generate code in the Kudu package path and 
of course checkstlye would fail this as well. The fix is to let Checktyle 
include a “excludes” pattern and make check style ignore all java files that 
represent a pattern of files generated by the Antlr4 code-gen tool. There is 
still an issue that remains to be resolved even if this approach is approved by 
the community. The issue is the tool generates a couple of “.token” files that 
are always placed in the root class path and not under the package structure 
which will pollute the sanity a bit. I am still working on this bit as this 
needs to be resolved. 

Option 3: Perhaps the ideal is to let a separate module for kudu from the top 
level to resolve all of the issues ideally ( i.e. token files are generated in 
the kudu module root along with the java sources in the correct package 
structure ) and I guess that is a separate discussion that Thomas/Vlad and 
others are planning to take up as a separate thread in the mailing list. 

Could you please let me what you think is the ideal path to pursue ( or if 
there are other alternatives for the use case above ) 

Regards,
Ananth 

Re: Ab initio vs Apex

2016-12-09 Thread Ananth G
Sorry for the spam .. This mail was for the user mailing list and not dev.. 
Reposted it in the user mailing list..

Regards
Ananth

> On 9 Dec. 2016, at 8:14 pm, Ananth G <ananthg.a...@gmail.com> wrote:
> 
> Hello all,
> 
> I was wondering if anyone has done any tests comparing Apex with Ab initio. 
> 
> I guess Abinitio is closed source and there is not much I could gather from 
> the internet 
> 
> Regards
> Ananth


Ab initio vs Apex

2016-12-09 Thread Ananth G
Hello all,

I was wondering if anyone has done any tests comparing Apex with Ab initio. 

I guess Abinitio is closed source and there is not much I could gather from the 
internet 

Regards
Ananth

Re: Apex malhar Guava dependency

2016-12-01 Thread Ananth G
Hello Bright,

Did the maven approach of excluding guava dependencies when specifying malhar 
not work ? 

By this I mean using the following stub when adding malhar dependency in your 
Pom.xml?

 
  
   com.google.guava   guava
  
   




Regards
Ananth

> On 2 Dec. 2016, at 1:02 pm, Munagala Ramanath  wrote:
> 
> Bright,
> 
> We have this in contrib/pom.xml which I think is causing the problem:
> 
>
>  com.google.guava
>  guava
>  16.0.1
>  provided
>  true
>
> 
> whereas the hadoop-common 2.2.0 depends on 11.0.2
> 
> Ram
> 
>> On Thu, Dec 1, 2016 at 4:11 PM, Bright Chen  wrote:
>> 
>> Hi All,
>> I got an runtime exception (In unit test in local mode) when trying to test
>> Guava BloomFilter in malhar. the exception is as following. It seem the
>> malhar depended on Guava 11.0.2 at compile time while depended on higher
>> version at runtime. Which caused this compatible issue.
>> 
>> Anyone got same issue or any idea how to solve or get around this issue?
>> 
>> java.lang.AbstractMethodError:
>> org.apache.apex.malhar.lib.state.managed.SliceFunnel.
>> funnel(Ljava/lang/Object;Lcom/google/common/hash/PrimitiveSink;)V
>> 
>> at
>> com.google.common.hash.AbstractStreamingHashFunction$
>> AbstractStreamingHasher.putObject(
>> AbstractStreamingHashFunction.java:223)
>> 
>> at com.google.common.hash.AbstractStreamingHashFunction.hashObject(
>> AbstractStreamingHashFunction.java:37)
>> 
>> thanks
>> Bright
>> 


Re: [jira] [Comment Edited] (APEXCORE-576) Apex/Malhar Extensions

2016-11-28 Thread Ananth G
If we were to model something along the lines of "Spark packages", I am
afraid that the story of Apex being a "rich ecosystem" might be diluted.

Today I see Malhar as a strong reason to convince someone that we can
interconnect many things as part of the offering. The moment we put a
layered process, my only concern is that the review process would be very
very loose  and hence lower quality creeping in as Apex committers need not
necessarily review such packages and hence provide inputs to the evolving
Apex core etc. There is a risk that things might diverge more than we want
them to ?


Thoughts ?


Regards,
Ananth

On Tue, Nov 29, 2016 at 6:54 AM, Sanjay M Pujare (JIRA) 
wrote:

>
> [ https://issues.apache.org/jira/browse/APEXCORE-576?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=15702868#comment-15702868 ]
>
> Sanjay M Pujare edited comment on APEXCORE-576 at 11/28/16 7:54 PM:
> 
>
> As far as I understand then the main enhancement here is the "registry"
> and a process to pick items from the registry to merge into the main
> repository and we will be using the current git process of pull-requests
> and merging across forks. Are there existing models we can look at for the
> registry?
>
> A few things that need to be hammered out:
> - registry access control, and format
> - criteria for merging external contributions into the main repo (quality,
> unit tests, demand, licensing)
> - level of automation available, or otherwise the manual process used to
> merge contributions
> - dependency matrix management (e.g. an item depends on malhar 3.5.0
> minimum)
> - guidelines for adding items to the registry to be followed by
> contributors
>
>
> was (Author: sanjaypujare):
> As far as I understand then the main enhancement here is the "registry"
> and a process to pick items from the registry to merge into the main
> repository and we will be using the current git process of pull-requests
> and merging across forks. Are there existing models we can look at for the
> registry?
>
> A few things that need to be hammered out:
> - registry access control, and format
> - criteria for merging external contributions into the main repo (quality,
> unit tests, demand, licensing)
> - level of automation available, or otherwise the manual process used to
> merge contributions
> - dependency matrix management (e.g. an item depends malhar 3.5.0 minimum)
> - guidelines for adding items to the registry to be followed by
> contributors
>
> > Apex/Malhar Extensions
> > --
> >
> > Key: APEXCORE-576
> > URL: https://issues.apache.org/jira/browse/APEXCORE-576
> > Project: Apache Apex Core
> >  Issue Type: New Feature
> >  Components: Website
> >Reporter: Chinmay Kolhatkar
> >
> > The purpose of this task is to provide a way to external contributors to
> make better contributions to Apache Apex project.
> > The idea looks something like this:
> > 1. One could have extension to apex core/malhar in their own repository
> and just register itself with Apache Apex.
> > 2. As it matures and find more and more use we can consume that in
> mainstream releases.
> > 3. Some possibilities of of Apex extensions are as follows:
> > a. Operators - DataSources, DataDestinations, Parsers, Formatters,
> Processing etc.
> > b. Apex Engine Plugables
> > c. External Integrations
> > d. Integration with other platform like Machine learning, Graph
> engines etc.
> > e. Application which are ready to use.
> > d. Apex related tools which can ease the development and usage of
> apex.
> > The initial discussion about this has happened here:
> > https://lists.apache.org/thread.html/3d6ca2b46c53df77f37f54d64e1860
> 7a623c5a54f439e1afebfbef35@%3Cdev.apex.apache.org%3E
> > More concrete discussion/implementation proposal required for this task
> to progress.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>