Re: APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread Chinmay Kolhatkar
I agree with the requirement. But I think the requirement here is superceed
by requirement at
https://lists.apache.org/thread.html/d1e282047399558a4f04aab186c0d986114d353dcaf6d6c76a2a3a35@%3Cdev.apex.apache.org%3E

Having PojoInnerJoin to emit a POJO instead of map overrides this.

I suggest to do that first and if still there is a scope we look into this.

-Chinmay.


On Thu, Feb 16, 2017 at 12:27 AM, Shunxin Lu  wrote:

> Hi Hitesh,
> You are absolutely right. The PojoInnerJoin accumulation we have now is
> only
> to test the implementation of WindowedMergeOperator. I did not considered
> the case you mentioned when developing, so please make changes to fix this.
> Thanks,
> Shunxin
>
> On Wed, Feb 15, 2017 at 4:25 AM, Hitesh Kapoor 
> wrote:
>
> > Hi All,
> >
> > In PojoInnerJoin accumulation same field names are emitted as single
> field
> > even if we don't take a join on them. For example consider the following
> 2
> > POJO's on 2 streams
> >
> > POJO1
> > {
> >id: Int
> >age : String
> > }
> >
> > POJO2
> > {
> >  id: Int
> >  age : String
> >  name : String
> > }
> >
> > If we wish to take a join only on field id then the resulting stream
> > contains the common named field(age) only from POJO2.
> > So I am confused whether the resulting stream should contain the field
> > 'age' from only POJO1 (or only POJO2) or it should contain the field
> 'age'
> > from both the POJOs.
> >
> > I think it is a bug which should be fixed and the resulting stream should
> > contain common named field from both the POJOs (and maybe rename it in
> the
> > final output). Let me know your thoughts on it.
> >
> > Regards,
> > Hitesh
> >
>


Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

2017-02-15 Thread Thomas Weise
I don't think this should be designed based on a simplistic file
input-output scenario. It would be good to include a stateful
transformation based on event time.

More complex pipelines contain stateful transformations that depend on
windowing and watermarks. I think we need a watermark concept that is based
on progress in event time (or other monotonic increasing sequence) that
other operators can generically work with.

Note that even file input in many cases can produce time based watermarks,
for example when you read part files that are bound by event time.

Thanks,
Thomas


On Wed, Feb 15, 2017 at 4:02 AM, Bhupesh Chawda 
wrote:

> For better understanding the use case for control tuples in batch, ​I am
> creating a prototype for a batch application using File Input and File
> Output operators.
>
> To enable basic batch processing for File IO operators, I am proposing the
> following changes to File input and output operators:
> 1. File Input operator emits a watermark each time it opens and closes a
> file. These can be "start file" and "end file" watermarks which include the
> corresponding file names. The "start file" tuple should be sent before any
> of the data from that file flows.
> 2. File Input operator can be configured to end the application after a
> single or n scans of the directory (a batch). This is where the operator
> emits the final watermark (the end of application control tuple). This will
> also shutdown the application.
> 3. The File output operator handles these control tuples. "Start file"
> initializes the file name for the incoming tuples. "End file" watermark
> forces a finalize on that file.
>
> The user would be able to enable the operators to send only those
> watermarks that are needed in the application. If none of the options are
> configured, the operators behave as in a streaming application.
>
> There are a few challenges in the implementation where the input operator
> is partitioned. In this case, the correlation between the start/end for a
> file and the data tuples for that file is lost. Hence we need to maintain
> the filename as part of each tuple in the pipeline.
>
> The "start file" and "end file" control tuples in this example are
> temporary names for watermarks. We can have generic "start batch" / "end
> batch" tuples which could be used for other use cases as well. The Final
> watermark is common and serves the same purpose in each case.
>
> Please let me know your thoughts on this.
>
> ~ Bhupesh
>
>
>
> On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda 
> wrote:
>
> > Yes, this can be part of operator configuration. Given this, for a user
> to
> > define a batch application, would mean configuring the connectors (mostly
> > the input operator) in the application for the desired behavior.
> Similarly,
> > there can be other use cases that can be achieved other than batch.
> >
> > We may also need to take care of the following:
> > 1. Make sure that the watermarks or control tuples are consistent across
> > sources. Meaning an HDFS sink should be able to interpret the watermark
> > tuple sent out by, say, a JDBC source.
> > 2. In addition to I/O connectors, we should also look at the need for
> > processing operators to understand some of the control tuples /
> watermarks.
> > For example, we may want to reset the operator behavior on arrival of
> some
> > watermark tuple.
> >
> > ~ Bhupesh
> >
> > On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise  wrote:
> >
> >> The HDFS source can operate in two modes, bounded or unbounded. If you
> >> scan
> >> only once, then it should emit the final watermark after it is done.
> >> Otherwise it would emit watermarks based on a policy (files names etc.).
> >> The mechanism to generate the marks may depend on the type of source and
> >> the user needs to be able to influence/configure it.
> >>
> >> Thomas
> >>
> >>
> >> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <
> bhup...@datatorrent.com>
> >> wrote:
> >>
> >> > Hi Thomas,
> >> >
> >> > I am not sure that I completely understand your suggestion. Are you
> >> > suggesting to broaden the scope of the proposal to treat all sources
> as
> >> > bounded as well as unbounded?
> >> >
> >> > In case of Apex, we treat all sources as unbounded sources. Even
> bounded
> >> > sources like HDFS file source is treated as unbounded by means of
> >> scanning
> >> > the input directory repeatedly.
> >> >
> >> > Let's consider HDFS file source for example:
> >> > In this case, if we treat it as a bounded source, we can define hooks
> >> which
> >> > allows us to detect the end of the file and send the "final
> watermark".
> >> We
> >> > could also consider HDFS file source as a streaming source and define
> >> hooks
> >> > which send watermarks based on different kinds of windows.
> >> >
> >> > Please correct me if I misunderstand.
> >> >
> >> > ~ Bhupesh
> >> >
> >> >
> >> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas 

Regarding Licenses

2017-02-15 Thread Ameya Gokhale
Hi there.

I am an Intern at DataTorrent. I want to use these repositories for my
project. could you tell me whether I can use these for my project as I
would be using these above Apache Apex.

JavaCV: https://github.com/bytedeco/javacv

The License says You may use this work under the terms of either the Apache
License,
Version 2.0, or the GNU General Public License (GPL), either version 2,
or any later version, with "Classpath" exception (details below).

You don't have to do anything special to choose one license or the other
and you don't have to notify anyone which license you are using. You are
free to use this work in any project (even commercial projects) as long
as the copyright header is left intact.

JavaANPR: https://github.com/oskopek/javaanpr

The License says The Educational Community License version 2.0 ("ECL")
consists of the Apache 2.0 license,
modified to change the scope of the patent grant in section 3 to be
specific to the needs
of the education communities using this license. The original Apache 2.0
license can be found at:
http://www.apache.org/licenses/LICENSE-2.0

Just wanted to know whether I can use it to make my project for DataTorrent
or not.

Thanks,
Ameya Vijay Gokhale


Regarding Licenses

2017-02-15 Thread Ameya Gokhale
Hi there.

I am an Intern at DataTorrent. I want to use these repositories for my
project. could you tell me whether I can use these for my project as I
would be using these above Apache Apex.

JavaCV: https://github.com/bytedeco/javacv

The License says You may use this work under the terms of either the Apache
License,
Version 2.0, or the GNU General Public License (GPL), either version 2,
or any later version, with "Classpath" exception (details below).

You don't have to do anything special to choose one license or the other
and you don't have to notify anyone which license you are using. You are
free to use this work in any project (even commercial projects) as long
as the copyright header is left intact.

JavaANPR: https://github.com/oskopek/javaanpr

The License says The Educational Community License version 2.0 ("ECL")
consists of the Apache 2.0 license,
modified to change the scope of the patent grant in section 3 to be
specific to the needs
of the education communities using this license. The original Apache 2.0
license can be found at:
http://www.apache.org/licenses/LICENSE-2.0

Just wanted to know whether I can use it to make my project for DataTorrent
or not.

Also ,kindly add/subscribe me to the mailing list so that i can stay
updated.

Thanks,
Ameya Vijay Gokhale


Re: PojoOuterJoin (Left, Right and Full) Accumulation

2017-02-15 Thread Shunxin Lu
Hi Hitesh,
This is a great idea IMO. Like I mentioned in another thread, the
PojoInnerJoin and other MergeAccumulations are only to test the
implementation of WindowedMergeOperator. I did not consider
many use cases when writing those accumulations, so it will be
great if we can handle more types of joins.
I personally think the name of the base class you proposed should
be PojoJoin because we are only handling streams of Pojos in these
accumulations.
Thanks,
Shuxnin

On Mon, Feb 13, 2017 at 9:50 PM, Hitesh Kapoor 
wrote:

> Hi All,
>
> I am proposing PojoOuterJoin accumulation.
> We already have PojoInnerJoin accumulation.
>
> As the name suggests this accumulation will be responsible  to get outer
> join on Pojo streams
> on some input keys. The input and output streams (types) for this
> accumulation will be the same
> as that of PojoInnerJoin.
>
> I am planning to create a base class named as Join,
> PojoInnerJoin,PojoLeftOuterJoin.
> PojoRightOuterJoin,PojoFullOuterJoin can extend the functionalities of
> Join
> to get the
> required join, because there will be a lot of common code in these
> accumulations.
>
> Let me know your thoughts on it.
>
> Regards,
> Hitesh
>


Re: APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread Shunxin Lu
Hi Hitesh,
You are absolutely right. The PojoInnerJoin accumulation we have now is
only
to test the implementation of WindowedMergeOperator. I did not considered
the case you mentioned when developing, so please make changes to fix this.
Thanks,
Shunxin

On Wed, Feb 15, 2017 at 4:25 AM, Hitesh Kapoor 
wrote:

> Hi All,
>
> In PojoInnerJoin accumulation same field names are emitted as single field
> even if we don't take a join on them. For example consider the following 2
> POJO's on 2 streams
>
> POJO1
> {
>id: Int
>age : String
> }
>
> POJO2
> {
>  id: Int
>  age : String
>  name : String
> }
>
> If we wish to take a join only on field id then the resulting stream
> contains the common named field(age) only from POJO2.
> So I am confused whether the resulting stream should contain the field
> 'age' from only POJO1 (or only POJO2) or it should contain the field 'age'
> from both the POJOs.
>
> I think it is a bug which should be fixed and the resulting stream should
> contain common named field from both the POJOs (and maybe rename it in the
> final output). Let me know your thoughts on it.
>
> Regards,
> Hitesh
>


Re: APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread Sanjay Pujare
Yes, IMO it should be fixed.

On 2/15/17, 6:56 AM, "AJAY GUPTA"  wrote:

This is a bug if we consider a normal database join. All fields from both
POJOs should be emitted in the result irrespective of the name.

Ajay

On Wed, 15 Feb 2017 at 5:55 PM, Hitesh Kapoor 
wrote:

> Hi All,
>
> In PojoInnerJoin accumulation same field names are emitted as single field
> even if we don't take a join on them. For example consider the following 2
> POJO's on 2 streams
>
> POJO1
> {
>id: Int
>age : String
> }
>
> POJO2
> {
>  id: Int
>  age : String
>  name : String
> }
>
> If we wish to take a join only on field id then the resulting stream
> contains the common named field(age) only from POJO2.
> So I am confused whether the resulting stream should contain the field
> 'age' from only POJO1 (or only POJO2) or it should contain the field 'age'
> from both the POJOs.
>
> I think it is a bug which should be fixed and the resulting stream should
> contain common named field from both the POJOs (and maybe rename it in the
> final output). Let me know your thoughts on it.
>
> Regards,
> Hitesh
>





[GitHub] apex-malhar pull request #553: upgrade httpclient version

2017-02-15 Thread mattqzhang
GitHub user mattqzhang opened a pull request:

https://github.com/apache/apex-malhar/pull/553

upgrade httpclient version



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mattqzhang/apex-malhar httpclientcve

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/553.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #553


commit d2f779a17230c639e86531ed40815036438379f8
Author: Matt Zhang 
Date:   2017-02-15T18:09:28Z

upgrade httpclient version




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-core pull request #473: upgrade httpclient version

2017-02-15 Thread mattqzhang
GitHub user mattqzhang opened a pull request:

https://github.com/apache/apex-core/pull/473

upgrade httpclient version



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mattqzhang/apex-core httpclientcve

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/473.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #473


commit 73ca9ab5b7c5b5d752c7e7ff8f7577e56ef85878
Author: Matt Zhang 
Date:   2017-02-15T18:11:18Z

upgrade httpclient version




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-core pull request #466: APEXCORE-634 Apex Platform unable to set unifie...

2017-02-15 Thread deepak-narkhede
GitHub user deepak-narkhede reopened a pull request:

https://github.com/apache/apex-core/pull/466

APEXCORE-634 Apex Platform unable to set unifier attributes for modules

Problem: Unable to set Unifier attributes of output port within modules
Description: When modules are flatten in logical Plan of DAG, only top 
level attributes are cloned of OperatorMeta. Unifier attributes are not copied 
in PortMapping for output ports.
Solution:Clone the unifier attributes while flattening DAG in logical 
Plan.

Testing done for custom application with and without modules also with app 
template HDFS to S3 module. Have had debug logs and stack traces to verify the 
fix.

One of the snip of logs had set the unifier attribute TIMEOUT_WINDOW_COUNT 
to 10 ( i.e 1000 millisec)

**Without fix:**

2017-02-06 15:51:31,539 WARN 
com.datatorrent.stram.StreamingContainerManager: UNIFIER operator 
PTOperator[id=4,name=**genmodule$gen.out#unifier]** committed window 
, recovery window , current time 1486376491539, 
last window id change time 0, **window processing timeout millis 6**


**With Fix:**

2017-02-06 14:22:49,602 WARN 
com.datatorrent.stram.StreamingContainerManager: UNIFIER operator 
PTOperator[id=4,name=**genmodule$gen.out#unifier]** committed window 
, recovery window , current time 1486371169602, 
last window id change time 0, **window processing timeout millis 1000**



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/deepak-narkhede/apex-core APEXCORE-634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/466.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #466


commit 12ce7c5d8082289c4690d70b09e65033090c44e1
Author: deepak-narkhede 
Date:   2017-02-07T11:09:07Z

APEXCORE-634 Apex Platform unable to set unifier attributes for modules in 
DAG.

Problem: Unable to set Unifier attributes of output port within modules
Description: When modules are flatten in logical Plan of DAG, only top 
level attributes are cloned of OperatorMeta.
 Unifier attributes are not copied in PortMapping for output ports.
Solution:Clone the unifier attributes while flattening DAG in logical 
Plan.

commit 5df34a81670520cedc74b1ac3f14d1a2d02ca203
Author: deepak-narkhede 
Date:   2017-02-15T16:09:23Z

APEXCORE-634 Added unit test for testing unifier attribute for module.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXCORE-634) Apex Platform unable to set unifier attributes for modules in DAG

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868202#comment-15868202
 ] 

ASF GitHub Bot commented on APEXCORE-634:
-

GitHub user deepak-narkhede reopened a pull request:

https://github.com/apache/apex-core/pull/466

APEXCORE-634 Apex Platform unable to set unifier attributes for modules

Problem: Unable to set Unifier attributes of output port within modules
Description: When modules are flatten in logical Plan of DAG, only top 
level attributes are cloned of OperatorMeta. Unifier attributes are not copied 
in PortMapping for output ports.
Solution:Clone the unifier attributes while flattening DAG in logical 
Plan.

Testing done for custom application with and without modules also with app 
template HDFS to S3 module. Have had debug logs and stack traces to verify the 
fix.

One of the snip of logs had set the unifier attribute TIMEOUT_WINDOW_COUNT 
to 10 ( i.e 1000 millisec)

**Without fix:**

2017-02-06 15:51:31,539 WARN 
com.datatorrent.stram.StreamingContainerManager: UNIFIER operator 
PTOperator[id=4,name=**genmodule$gen.out#unifier]** committed window 
, recovery window , current time 1486376491539, 
last window id change time 0, **window processing timeout millis 6**


**With Fix:**

2017-02-06 14:22:49,602 WARN 
com.datatorrent.stram.StreamingContainerManager: UNIFIER operator 
PTOperator[id=4,name=**genmodule$gen.out#unifier]** committed window 
, recovery window , current time 1486371169602, 
last window id change time 0, **window processing timeout millis 1000**



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/deepak-narkhede/apex-core APEXCORE-634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/466.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #466


commit 12ce7c5d8082289c4690d70b09e65033090c44e1
Author: deepak-narkhede 
Date:   2017-02-07T11:09:07Z

APEXCORE-634 Apex Platform unable to set unifier attributes for modules in 
DAG.

Problem: Unable to set Unifier attributes of output port within modules
Description: When modules are flatten in logical Plan of DAG, only top 
level attributes are cloned of OperatorMeta.
 Unifier attributes are not copied in PortMapping for output ports.
Solution:Clone the unifier attributes while flattening DAG in logical 
Plan.

commit 5df34a81670520cedc74b1ac3f14d1a2d02ca203
Author: deepak-narkhede 
Date:   2017-02-15T16:09:23Z

APEXCORE-634 Added unit test for testing unifier attribute for module.




> Apex Platform unable to set unifier attributes for modules in DAG
> -
>
> Key: APEXCORE-634
> URL: https://issues.apache.org/jira/browse/APEXCORE-634
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Deepak Narkhede
>Assignee: Deepak Narkhede
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: PojoInnerJoin Accumulation emitting Map

2017-02-15 Thread Chinmay Kolhatkar
Thanks guys. I'll create a jira and create a PR for it soon.

Ajay, btw this is not much related to Schema discovery design... This is
for windowed operator.

- Chinmay.

On 15 Feb 2017 10:11 p.m., "Amol Kekre"  wrote:

> yes it should be POJO
>
> Thks
> Amol
>
> *Join us at Apex Big Data World-San Jose
> , April 4, 2017!*
> [image: http://www.apexbigdata.com/san-jose-register.html]
> 
>
> On Wed, Feb 15, 2017 at 7:34 AM, AJAY GUPTA  wrote:
>
> > Yes, it should be emitting a POJO. This POJO can then be further used to
> > join with a third POJO stream, thus behaving like a DB join.
> >
> > It would be best if we can incorporate this into Schema Discovery design.
> >
> >
> > Ajay
> >
> > On Wed, 15 Feb 2017 at 5:30 PM, Chinmay Kolhatkar 
> > wrote:
> >
> > Dear Community,
> >
> > Currently PojoInnerJoin accumulation is accepting 2 POJOs but emitting a
> > Map.
> >
> > I think it should be emitting POJO instead of Map.
> >
> > Please share your thoughts about this.
> >
> > Thanks,
> > Chinmay.
> >
>


[jira] [Commented] (APEXCORE-634) Apex Platform unable to set unifier attributes for modules in DAG

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868166#comment-15868166
 ] 

ASF GitHub Bot commented on APEXCORE-634:
-

Github user deepak-narkhede closed the pull request at:

https://github.com/apache/apex-core/pull/466


> Apex Platform unable to set unifier attributes for modules in DAG
> -
>
> Key: APEXCORE-634
> URL: https://issues.apache.org/jira/browse/APEXCORE-634
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Deepak Narkhede
>Assignee: Deepak Narkhede
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] apex-core pull request #466: APEXCORE-634 Apex Platform unable to set unifie...

2017-02-15 Thread deepak-narkhede
Github user deepak-narkhede closed the pull request at:

https://github.com/apache/apex-core/pull/466


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: PojoInnerJoin Accumulation emitting Map

2017-02-15 Thread Amol Kekre
yes it should be POJO

Thks
Amol

*Join us at Apex Big Data World-San Jose
, April 4, 2017!*
[image: http://www.apexbigdata.com/san-jose-register.html]


On Wed, Feb 15, 2017 at 7:34 AM, AJAY GUPTA  wrote:

> Yes, it should be emitting a POJO. This POJO can then be further used to
> join with a third POJO stream, thus behaving like a DB join.
>
> It would be best if we can incorporate this into Schema Discovery design.
>
>
> Ajay
>
> On Wed, 15 Feb 2017 at 5:30 PM, Chinmay Kolhatkar 
> wrote:
>
> Dear Community,
>
> Currently PojoInnerJoin accumulation is accepting 2 POJOs but emitting a
> Map.
>
> I think it should be emitting POJO instead of Map.
>
> Please share your thoughts about this.
>
> Thanks,
> Chinmay.
>


Re: [DISCUSS] Custom Control Tuples Design

2017-02-15 Thread Amol Kekre
This is needed, the batch start-end have similar semantics as start-end
window from operational/functional perspective.

Thks
Amol


*Join us at Apex Big Data World-San Jose
, April 4, 2017!*
[image: http://www.apexbigdata.com/san-jose-register.html]


On Tue, Feb 14, 2017 at 9:55 PM, Bhupesh Chawda 
wrote:

> +1 for having an immediate delivery mechanism as well.
>
> I would suggest that the other delivery mechanism stays at end of window,
> to be consistent, as I think it may be difficult to determine the last
> arrival of the tuple.
>
> ~ Bhupesh
>
> On Wed, Feb 15, 2017 at 7:04 AM, Pramod Immaneni 
> wrote:
>
> > There have been some recent developments and discussions on the schema
> side
> > (link below) that warrant a reconsideration of how control tuples get
> > delivered.
> >
> > http://apache.markmail.org/search/?q=apex+list%3Aorg.
> > apache.apex.dev+schema+discovery+support#query:apex%
> > 20list%3Aorg.apache.apex.dev%20schema%20discovery%20support+page:1+mid:
> > oaji26y3xfozap5v+state:results
> >
> > What I would like to suggest is that we allow two delivery options for
> > control tuples which can be configured on a per control tuple basis.
> First
> > is to deliver control tuple to the operator when the first instance of
> the
> > tuple arrives from any path. Second option is to deliver the control
> tuple
> > when the last instance of the tuple arrives from all the paths or at the
> > end window if it is going to be difficult to determine the last arrival.
> > The developer can choose the delivery option for the control tuple
> > preferably when the tuple is created. The first option will be useful for
> > scenarios like schema propagation or begin file in case of batch cases.
> The
> > second option will be useful for tuples like end file or end batch in
> batch
> > use cases.
> >
> > Thanks
> >
> > On Tue, Jan 10, 2017 at 12:27 PM, Bhupesh Chawda <
> bhup...@datatorrent.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > Based on some discussion here is what is planned for the propagation
> > > feature for control tuples.
> > >
> > > The signature of the *processControl()* method in
> > > *ControlAwareDefaultInputPort* which is implemented by the operator
> > > developer will be as follows:
> > >
> > > *public abstract boolean processControl(UserDefinedControlTuple
> > payload);*
> > >
> > > The boolean returned by the processControl() method indicates (to the
> > > engine) whether or not the operator is able to handle the control tuple
> > and
> > > wants to take care of the propagation of the control tuple.
> > >
> > >- If the method returns true - indicating it is able to handle the
> > >control tuple, the operator has to explicitly emit the control
> tuples
> > to
> > >the output ports it wishes to propagate to.
> > >
> > >
> > >- If the method returns false - indicating it is not able to handle
> > the
> > >control tuple, the control tuple will be propagated by the engine to
> > all
> > >output ports.
> > >
> > > The operator may even emit new control tuples in either of the cases.
> > > Note that for ports that are not control aware, the control tuple is
> > > propagated by default.
> > >
> > > We don't need any output port annotations or operator level attributes.
> > >
> > > ~ Bhupesh
> > >
> > >
> > > On Mon, Jan 9, 2017 at 5:16 PM, Tushar Gosavi 
> > > wrote:
> > >
> > > > On Sun, Jan 8, 2017 at 11:49 PM, Vlad Rozov  >
> > > > wrote:
> > > > > +1 to manage propagation at an operator level. An operator is
> either
> > > > control
> > > > > tuple aware and needs to manage how control tuples are routed from
> > > input
> > > > > ports to output ports or it is not. In the later case it does not
> > > matter
> > > > how
> > > > > many input and output ports the operator has and it is the Apex
> > > platform
> > > > > responsibility to route control tuples. I don't see a use case
> where
> > an
> > > > > operator that is not aware of a control tuple needs to manage one
> or
> > > more
> > > > > input ports (or similar output ports) differently than others.
> > > > >
> > > >
> > > > The problem with giving explicit control to operator for routing of
> > > > custom tuples is how does the operator
> > > > developer knows about control tuple requirement for downstream
> > > > operators in an application. For example in following DAG
> > > > A -> B -> C
> > > > A - is my custom source operator which emits a new control tuple type
> > C1
> > > > and C.
> > > > B - is operator from malhar which handle control tuple C.
> > > > C - is custom output operator which handles C1.
> > > >
> > > > If B is managing control tuples, then it needs to remember to foward
> > > > unhandled tuples on all output port, else it will block
> > > > the tuples for downstream operator which might need them, 

Re: PojoInnerJoin Accumulation emitting Map

2017-02-15 Thread AJAY GUPTA
Yes, it should be emitting a POJO. This POJO can then be further used to
join with a third POJO stream, thus behaving like a DB join.

It would be best if we can incorporate this into Schema Discovery design.


Ajay

On Wed, 15 Feb 2017 at 5:30 PM, Chinmay Kolhatkar 
wrote:

Dear Community,

Currently PojoInnerJoin accumulation is accepting 2 POJOs but emitting a
Map.

I think it should be emitting POJO instead of Map.

Please share your thoughts about this.

Thanks,
Chinmay.


Re: APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread AJAY GUPTA
This is a bug if we consider a normal database join. All fields from both
POJOs should be emitted in the result irrespective of the name.

Ajay

On Wed, 15 Feb 2017 at 5:55 PM, Hitesh Kapoor 
wrote:

> Hi All,
>
> In PojoInnerJoin accumulation same field names are emitted as single field
> even if we don't take a join on them. For example consider the following 2
> POJO's on 2 streams
>
> POJO1
> {
>id: Int
>age : String
> }
>
> POJO2
> {
>  id: Int
>  age : String
>  name : String
> }
>
> If we wish to take a join only on field id then the resulting stream
> contains the common named field(age) only from POJO2.
> So I am confused whether the resulting stream should contain the field
> 'age' from only POJO1 (or only POJO2) or it should contain the field 'age'
> from both the POJOs.
>
> I think it is a bug which should be fixed and the resulting stream should
> contain common named field from both the POJOs (and maybe rename it in the
> final output). Let me know your thoughts on it.
>
> Regards,
> Hitesh
>


APEXMALHAR-2400 In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread Hitesh Kapoor
Hi All,

In PojoInnerJoin accumulation same field names are emitted as single field
even if we don't take a join on them. For example consider the following 2
POJO's on 2 streams

POJO1
{
   id: Int
   age : String
}

POJO2
{
 id: Int
 age : String
 name : String
}

If we wish to take a join only on field id then the resulting stream
contains the common named field(age) only from POJO2.
So I am confused whether the resulting stream should contain the field
'age' from only POJO1 (or only POJO2) or it should contain the field 'age'
from both the POJOs.

I think it is a bug which should be fixed and the resulting stream should
contain common named field from both the POJOs (and maybe rename it in the
final output). Let me know your thoughts on it.

Regards,
Hitesh


Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases

2017-02-15 Thread Bhupesh Chawda
For better understanding the use case for control tuples in batch, ​I am
creating a prototype for a batch application using File Input and File
Output operators.

To enable basic batch processing for File IO operators, I am proposing the
following changes to File input and output operators:
1. File Input operator emits a watermark each time it opens and closes a
file. These can be "start file" and "end file" watermarks which include the
corresponding file names. The "start file" tuple should be sent before any
of the data from that file flows.
2. File Input operator can be configured to end the application after a
single or n scans of the directory (a batch). This is where the operator
emits the final watermark (the end of application control tuple). This will
also shutdown the application.
3. The File output operator handles these control tuples. "Start file"
initializes the file name for the incoming tuples. "End file" watermark
forces a finalize on that file.

The user would be able to enable the operators to send only those
watermarks that are needed in the application. If none of the options are
configured, the operators behave as in a streaming application.

There are a few challenges in the implementation where the input operator
is partitioned. In this case, the correlation between the start/end for a
file and the data tuples for that file is lost. Hence we need to maintain
the filename as part of each tuple in the pipeline.

The "start file" and "end file" control tuples in this example are
temporary names for watermarks. We can have generic "start batch" / "end
batch" tuples which could be used for other use cases as well. The Final
watermark is common and serves the same purpose in each case.

Please let me know your thoughts on this.

~ Bhupesh



On Wed, Jan 18, 2017 at 12:22 AM, Bhupesh Chawda 
wrote:

> Yes, this can be part of operator configuration. Given this, for a user to
> define a batch application, would mean configuring the connectors (mostly
> the input operator) in the application for the desired behavior. Similarly,
> there can be other use cases that can be achieved other than batch.
>
> We may also need to take care of the following:
> 1. Make sure that the watermarks or control tuples are consistent across
> sources. Meaning an HDFS sink should be able to interpret the watermark
> tuple sent out by, say, a JDBC source.
> 2. In addition to I/O connectors, we should also look at the need for
> processing operators to understand some of the control tuples / watermarks.
> For example, we may want to reset the operator behavior on arrival of some
> watermark tuple.
>
> ~ Bhupesh
>
> On Tue, Jan 17, 2017 at 9:59 PM, Thomas Weise  wrote:
>
>> The HDFS source can operate in two modes, bounded or unbounded. If you
>> scan
>> only once, then it should emit the final watermark after it is done.
>> Otherwise it would emit watermarks based on a policy (files names etc.).
>> The mechanism to generate the marks may depend on the type of source and
>> the user needs to be able to influence/configure it.
>>
>> Thomas
>>
>>
>> On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda 
>> wrote:
>>
>> > Hi Thomas,
>> >
>> > I am not sure that I completely understand your suggestion. Are you
>> > suggesting to broaden the scope of the proposal to treat all sources as
>> > bounded as well as unbounded?
>> >
>> > In case of Apex, we treat all sources as unbounded sources. Even bounded
>> > sources like HDFS file source is treated as unbounded by means of
>> scanning
>> > the input directory repeatedly.
>> >
>> > Let's consider HDFS file source for example:
>> > In this case, if we treat it as a bounded source, we can define hooks
>> which
>> > allows us to detect the end of the file and send the "final watermark".
>> We
>> > could also consider HDFS file source as a streaming source and define
>> hooks
>> > which send watermarks based on different kinds of windows.
>> >
>> > Please correct me if I misunderstand.
>> >
>> > ~ Bhupesh
>> >
>> >
>> > On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise  wrote:
>> >
>> > > Bhupesh,
>> > >
>> > > Please see how that can be solved in a unified way using windows and
>> > > watermarks. It is bounded data vs. unbounded data. In Beam for
>> example,
>> > you
>> > > can use the "global window" and the final watermark to accomplish what
>> > you
>> > > are looking for. Batch is just a special case of streaming where the
>> > source
>> > > emits the final watermark.
>> > >
>> > > Thanks,
>> > > Thomas
>> > >
>> > >
>> > > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <
>> bhup...@datatorrent.com
>> > >
>> > > wrote:
>> > >
>> > > > Yes, if the user needs to develop a batch application, then batch
>> aware
>> > > > operators need to be used in the application.
>> > > > The nature of the application is mostly controlled by the input and
>> the
>> > > > output operators used in the application.

PojoInnerJoin Accumulation emitting Map

2017-02-15 Thread Chinmay Kolhatkar
Dear Community,

Currently PojoInnerJoin accumulation is accepting 2 POJOs but emitting a
Map.

I think it should be emitting POJO instead of Map.

Please share your thoughts about this.

Thanks,
Chinmay.


[jira] [Commented] (APEXMALHAR-2400) In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867618#comment-15867618
 ] 

ASF GitHub Bot commented on APEXMALHAR-2400:


Github user Hitesh-Scorpio closed the pull request at:

https://github.com/apache/apex-malhar/pull/552


> In PojoInnerJoin accumulation same field names are emitted as single field 
> ---
>
> Key: APEXMALHAR-2400
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2400
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Hitesh Kapoor
>Assignee: Hitesh Kapoor
>
> In PojoInnerJoin accumulation same field names are emitted as single field 
> name. Internally it emits a map and for same filed name value is over written.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] apex-malhar pull request #552: APEXMALHAR-2400 same field names were emitted...

2017-02-15 Thread Hitesh-Scorpio
Github user Hitesh-Scorpio closed the pull request at:

https://github.com/apache/apex-malhar/pull/552


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXMALHAR-2400) In PojoInnerJoin accumulation same field names are emitted as single field

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867596#comment-15867596
 ] 

ASF GitHub Bot commented on APEXMALHAR-2400:


GitHub user Hitesh-Scorpio opened a pull request:

https://github.com/apache/apex-malhar/pull/552

APEXMALHAR-2400 same field names were emitted as single field

@chinmaykolhatkar please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Hitesh-Scorpio/apex-malhar 
APEXMALHAR-2400_PojoInnerJoin_Accumulation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #552


commit 9c9295f9231277f23ef54609c11f0c5cd8e6bb98
Author: Hitesh-Scorpio 
Date:   2017-02-15T10:23:57Z

APEXMALHAR-2400 same field names were emitted as single field




> In PojoInnerJoin accumulation same field names are emitted as single field 
> ---
>
> Key: APEXMALHAR-2400
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2400
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Hitesh Kapoor
>Assignee: Hitesh Kapoor
>
> In PojoInnerJoin accumulation same field names are emitted as single field 
> name. Internally it emits a map and for same filed name value is over written.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] apex-malhar pull request #552: APEXMALHAR-2400 same field names were emitted...

2017-02-15 Thread Hitesh-Scorpio
GitHub user Hitesh-Scorpio opened a pull request:

https://github.com/apache/apex-malhar/pull/552

APEXMALHAR-2400 same field names were emitted as single field

@chinmaykolhatkar please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Hitesh-Scorpio/apex-malhar 
APEXMALHAR-2400_PojoInnerJoin_Accumulation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #552


commit 9c9295f9231277f23ef54609c11f0c5cd8e6bb98
Author: Hitesh-Scorpio 
Date:   2017-02-15T10:23:57Z

APEXMALHAR-2400 same field names were emitted as single field




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: At-least once semantics for Kafka to Cassndra ingest

2017-02-15 Thread Himanshu Bari
Great. Thats what I thought. But, I suspect we might be seeing otherwise in
our app. Is there anything that needs to be done wrt to the offset
management in the kafka reader app to ensure this works correctly? Eg Does
the developer need to decide when the kafka reader app moves the Kafka
topic offset forward to acknowledge that all the messages read to that
point were successfully written into Cassandra?

On Feb 15, 2017 1:32 AM, "Pramod Immaneni"  wrote:

> That is the default behavior Himanshu. The Kafka operator will re-read
> from the older checkpointed offset if there is a failure in the Kafka
> operator. If there is a failure in any of the downstream operators
> including Cassandra, they will read the previous messages from the
> buffer present in the upstream operator, so in effect you won't lose
> messages.
>
> Thanks
>
> > On Feb 15, 2017, at 10:34 AM, Himanshu Bari 
> wrote:
> >
> > Hey guys
> >
> > We are looking for at least once semantics and not exactly once. Sinxe
> the
> > sink is Cassandra it is ok if tge same record is written twice..iy will
> > just overwrite on any reprocessing...
> >
> > Himanshu
> >
> >> On Feb 14, 2017 9:01 PM, "Priyanka Gugale" 
> wrote:
> >>
> >> For this particular case kafka -> cassandra, you need not worry about
> >> partial windows. Cassandra output operator does batch processing i.e.
> all
> >> records received in a window will be written at end window. So IMO, if
> you
> >> set exactly once processing on Kafka Input operator, and choose
> >> transactional cassandra output operator you will achieve exactly once
> >> processing. If you have other operators in your dag you might want to
> make
> >> sure they are idempotent (please check blog shared by Sandesh for
> >> reference).
> >>
> >> -Priyanka
> >>
> >> On Wed, Feb 15, 2017 at 4:06 AM, Sandesh Hegde  >
> >> wrote:
> >>
> >>> Settings mentioned by Sanjay, will only guarantee exactly once for
> >> Windows,
> >>> but not for partial window processed by the operator, in a way that
> >> setting
> >>> is a misnomer.
> >>> To achieve Exactly once, there are some precoditions that need to be
> met
> >>> along with the support in the output operator. Here is a blog that
> gives
> >>> the idea about exactly once,
> >>> https://www.datatorrent.com/blog/end-to-end-exactly-once-
> >> with-apache-apex/
> >>>
> >>> On Tue, Feb 14, 2017 at 2:11 PM Sanjay Pujare 
> >>> wrote:
> >>>
>  Have you taken a look at
>  http://apex.apache.org/docs/apex/application_development/#
> exactly-once
> >> ?
>  i.e. setting that processing mode on all the operators in the pipeline
> >> .
> 
>  Join us at Apex Big Data World-San Jose <
>  http://www.apexbigdata.com/san-jose.html>, April 4, 2017!
> 
>  http://www.apexbigdata.com/san-jose-register.html
> 
> 
>  On 2/14/17, 12:00 PM, "Himanshu Bari"  wrote:
> 
> How to ensure that the Kafka to Cassandra ingestion pipeline in
> >> Apex
>  will
> guarantee exactly once processing semantics.
> Eg. Message was read from Kafka but apex app died before it was
> >>> written
> successfully to Cassandra.
> 
> 
> 
>  --
> >>> *Join us at Apex Big Data World-San Jose
> >>> , April 4, 2017!*
> >>> [image: http://www.apexbigdata.com/san-jose-register.html]
> >>> 
> >>>
> >>
>


[GitHub] apex-malhar pull request #548: APEXMALHAR-2399 default constructor was direc...

2017-02-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/548


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXMALHAR-2399) In PojoInnerJoin accumulation default constructor is directly throwing an exception which messes up in default serialization.

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867535#comment-15867535
 ] 

ASF GitHub Bot commented on APEXMALHAR-2399:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/548


> In PojoInnerJoin accumulation default constructor is directly throwing an 
> exception which messes up in default serialization.
> -
>
> Key: APEXMALHAR-2399
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2399
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Hitesh Kapoor
>Assignee: Hitesh Kapoor
> Fix For: 3.7.0
>
>
> In PojoInnerJoin accumulation default constructor is directly throwing an 
> exception which messes up in default serialization. The exception is thrown 
> saying to specify the number of streams. Currently we only merge 2 streams so 
> there is no need to throw this exception also it messes up the serialization 
> and the applicaion can't be loaded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (APEXMALHAR-2399) In PojoInnerJoin accumulation default constructor is directly throwing an exception which messes up in default serialization.

2017-02-15 Thread Chinmay Kolhatkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kolhatkar resolved APEXMALHAR-2399.
---
   Resolution: Fixed
Fix Version/s: 3.7.0

> In PojoInnerJoin accumulation default constructor is directly throwing an 
> exception which messes up in default serialization.
> -
>
> Key: APEXMALHAR-2399
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2399
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Hitesh Kapoor
>Assignee: Hitesh Kapoor
> Fix For: 3.7.0
>
>
> In PojoInnerJoin accumulation default constructor is directly throwing an 
> exception which messes up in default serialization. The exception is thrown 
> saying to specify the number of streams. Currently we only merge 2 streams so 
> there is no need to throw this exception also it messes up the serialization 
> and the applicaion can't be loaded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: At-least once semantics for Kafka to Cassndra ingest

2017-02-15 Thread Pramod Immaneni
That is the default behavior Himanshu. The Kafka operator will re-read
from the older checkpointed offset if there is a failure in the Kafka
operator. If there is a failure in any of the downstream operators
including Cassandra, they will read the previous messages from the
buffer present in the upstream operator, so in effect you won't lose
messages.

Thanks

> On Feb 15, 2017, at 10:34 AM, Himanshu Bari  wrote:
>
> Hey guys
>
> We are looking for at least once semantics and not exactly once. Sinxe the
> sink is Cassandra it is ok if tge same record is written twice..iy will
> just overwrite on any reprocessing...
>
> Himanshu
>
>> On Feb 14, 2017 9:01 PM, "Priyanka Gugale"  wrote:
>>
>> For this particular case kafka -> cassandra, you need not worry about
>> partial windows. Cassandra output operator does batch processing i.e. all
>> records received in a window will be written at end window. So IMO, if you
>> set exactly once processing on Kafka Input operator, and choose
>> transactional cassandra output operator you will achieve exactly once
>> processing. If you have other operators in your dag you might want to make
>> sure they are idempotent (please check blog shared by Sandesh for
>> reference).
>>
>> -Priyanka
>>
>> On Wed, Feb 15, 2017 at 4:06 AM, Sandesh Hegde 
>> wrote:
>>
>>> Settings mentioned by Sanjay, will only guarantee exactly once for
>> Windows,
>>> but not for partial window processed by the operator, in a way that
>> setting
>>> is a misnomer.
>>> To achieve Exactly once, there are some precoditions that need to be met
>>> along with the support in the output operator. Here is a blog that gives
>>> the idea about exactly once,
>>> https://www.datatorrent.com/blog/end-to-end-exactly-once-
>> with-apache-apex/
>>>
>>> On Tue, Feb 14, 2017 at 2:11 PM Sanjay Pujare 
>>> wrote:
>>>
 Have you taken a look at
 http://apex.apache.org/docs/apex/application_development/#exactly-once
>> ?
 i.e. setting that processing mode on all the operators in the pipeline
>> .

 Join us at Apex Big Data World-San Jose <
 http://www.apexbigdata.com/san-jose.html>, April 4, 2017!

 http://www.apexbigdata.com/san-jose-register.html


 On 2/14/17, 12:00 PM, "Himanshu Bari"  wrote:

How to ensure that the Kafka to Cassandra ingestion pipeline in
>> Apex
 will
guarantee exactly once processing semantics.
Eg. Message was read from Kafka but apex app died before it was
>>> written
successfully to Cassandra.



 --
>>> *Join us at Apex Big Data World-San Jose
>>> , April 4, 2017!*
>>> [image: http://www.apexbigdata.com/san-jose-register.html]
>>> 
>>>
>>