[jira] [Commented] (APEXCORE-505) setup and activate calls in operator block heartbeat loop in container

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547462#comment-15547462
 ] 

ASF GitHub Bot commented on APEXCORE-505:
-

Github user sandeshh closed the pull request at:

https://github.com/apache/apex-core/pull/405


> setup and activate calls in operator block heartbeat loop in container
> --
>
> Key: APEXCORE-505
> URL: https://issues.apache.org/jira/browse/APEXCORE-505
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Ashwin Chandra Putta
>Assignee: Sandesh
>
> The setup and activate calls in the operators block the StreamingContainer 
> heartbeat loop to send heartbeats to application master. As a result, if 
> activation/setup takes more than the heartbeat timeout for any given operator 
> within a container, the app master ends up killing the container for 
> heartbeat timeout even though container is active.
> To test this: I created simple test application and added sleep for 40 
> seconds in setup or activate calls. The application master shows the heart 
> beat timeout message and kills the container with operator. Please find the 
> stack trace on the container while it was active as follows:
> sleep in activate callback
> ===
> {code}
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode):
> "Attach Listener" daemon prio=10 tid=0x7f8060ac nid=0x6d2f waiting on 
> condition [0x]
>java.lang.Thread.State: RUNNABLE
> "1/randomGenerator:RandomNumberGenerator" prio=10 tid=0x7f8060b9b800 
> nid=0x65a2 waiting on condition [0x7f8050bc8000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at com.datatorrent.sample.MyOperator.activate(MyOperator.java:63)
>   at com.datatorrent.stram.engine.Node.activate(Node.java:619)
>   at 
> com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:205)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1336)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:130)
>   at 
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1396)
> "ProcessWideEventLoop" prio=10 tid=0x7f8060afb800 nid=0x658b runnable 
> [0x7f8050cc9000]
>java.lang.Thread.State: RUNNABLE
>   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>   - locked <0x0007d186a380> (a 
> com.datatorrent.netlet.OptimizedEventLoop$SelectedSelectionKeySet)
>   - locked <0x0007d185a4a0> (a java.util.Collections$UnmodifiableSet)
>   - locked <0x0007d185a070> (a sun.nio.ch.EPollSelectorImpl)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:185)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:157)
>   at 
> com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:156)
>   at java.lang.Thread.run(Thread.java:745)
> "Dispatcher-0" daemon prio=10 tid=0x7f8060ae8000 nid=0x6586 waiting on 
> condition [0x7f8050dca000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007dc5fa118> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at 
> net.engio.mbassy.bus.AbstractSyncAsyncMessageBus$1.run(AbstractSyncAsyncMessageBus.java:51)
>   at java.lang.Thread.run(Thread.java:745)
> "org.apache.hadoop.hdfs.PeerCache@389ed39" daemon prio=10 
> tid=0x7f8060abf800 nid=0x6577 waiting on condition [0x7f8050fcc000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:255)
>   at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
>   at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
>   at java.lang.Thread.run(Thread.java:745)
> "IPC Parameter Sending Thread #0" daemon prio=10 tid=0x7f8060a2f800 
> nid=0x6572 waiting on condition 

[jira] [Commented] (APEXCORE-505) setup and activate calls in operator block heartbeat loop in container

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547463#comment-15547463
 ] 

ASF GitHub Bot commented on APEXCORE-505:
-

GitHub user sandeshh reopened a pull request:

https://github.com/apache/apex-core/pull/405

APEXCORE-505 Heartbeat loop was blocked waiting for operator activati…

…on, the reason for this is that Stream activation(Only 
BufferServerSubscriber and WindowGenerator) waits for operator activation in a 
heartbeat thread. There is no need to have this synchronization, as Tuples are 
pulled from the queues by the operators.

@vrozov @tweise please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/405.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #405


commit 596b996b1aa694bb9a0924f9094ef688df23678e
Author: Sandesh Hegde 
Date:   2016-10-05T01:29:57Z

APEXCORE-505 Heartbeat loop was blocked waiting for operator activation, 
reason for this is that Stream activation(Only BufferServerSubscriber and 
WindowGenerator) waits for operator activation in heartbeat thread. After 
analysis and sanity testing, we don't see the need to have the synchronization 
between operator and stream activation.




> setup and activate calls in operator block heartbeat loop in container
> --
>
> Key: APEXCORE-505
> URL: https://issues.apache.org/jira/browse/APEXCORE-505
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Ashwin Chandra Putta
>Assignee: Sandesh
>
> The setup and activate calls in the operators block the StreamingContainer 
> heartbeat loop to send heartbeats to application master. As a result, if 
> activation/setup takes more than the heartbeat timeout for any given operator 
> within a container, the app master ends up killing the container for 
> heartbeat timeout even though container is active.
> To test this: I created simple test application and added sleep for 40 
> seconds in setup or activate calls. The application master shows the heart 
> beat timeout message and kills the container with operator. Please find the 
> stack trace on the container while it was active as follows:
> sleep in activate callback
> ===
> {code}
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode):
> "Attach Listener" daemon prio=10 tid=0x7f8060ac nid=0x6d2f waiting on 
> condition [0x]
>java.lang.Thread.State: RUNNABLE
> "1/randomGenerator:RandomNumberGenerator" prio=10 tid=0x7f8060b9b800 
> nid=0x65a2 waiting on condition [0x7f8050bc8000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at com.datatorrent.sample.MyOperator.activate(MyOperator.java:63)
>   at com.datatorrent.stram.engine.Node.activate(Node.java:619)
>   at 
> com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:205)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1336)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:130)
>   at 
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1396)
> "ProcessWideEventLoop" prio=10 tid=0x7f8060afb800 nid=0x658b runnable 
> [0x7f8050cc9000]
>java.lang.Thread.State: RUNNABLE
>   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>   - locked <0x0007d186a380> (a 
> com.datatorrent.netlet.OptimizedEventLoop$SelectedSelectionKeySet)
>   - locked <0x0007d185a4a0> (a java.util.Collections$UnmodifiableSet)
>   - locked <0x0007d185a070> (a sun.nio.ch.EPollSelectorImpl)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:185)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:157)
>   at 
> com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:156)
>   at java.lang.Thread.run(Thread.java:745)
> "Dispatcher-0" daemon prio=10 tid=0x7f8060ae8000 nid=0x6586 waiting on 
> condition [0x7f8050dca000]
>

[GitHub] apex-core pull request #405: APEXCORE-505 Heartbeat loop was blocked waiting...

2016-10-04 Thread sandeshh
GitHub user sandeshh reopened a pull request:

https://github.com/apache/apex-core/pull/405

APEXCORE-505 Heartbeat loop was blocked waiting for operator activati…

…on, the reason for this is that Stream activation(Only 
BufferServerSubscriber and WindowGenerator) waits for operator activation in a 
heartbeat thread. There is no need to have this synchronization, as Tuples are 
pulled from the queues by the operators.

@vrozov @tweise please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/405.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #405


commit 596b996b1aa694bb9a0924f9094ef688df23678e
Author: Sandesh Hegde 
Date:   2016-10-05T01:29:57Z

APEXCORE-505 Heartbeat loop was blocked waiting for operator activation, 
reason for this is that Stream activation(Only BufferServerSubscriber and 
WindowGenerator) waits for operator activation in heartbeat thread. After 
analysis and sanity testing, we don't see the need to have the synchronization 
between operator and stream activation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXCORE-505) setup and activate calls in operator block heartbeat loop in container

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547380#comment-15547380
 ] 

ASF GitHub Bot commented on APEXCORE-505:
-

GitHub user sandeshh opened a pull request:

https://github.com/apache/apex-core/pull/405

APEXCORE-505 Heartbeat loop was blocked waiting for operator activati…

…on, the reason for this is that Stream activation(Only 
BufferServerSubscriber and WindowGenerator) waits for operator activation in a 
heartbeat thread. There is no need to have this synchronization, as Tuples are 
pulled from the queues by the operators.

@vrozov @tweise please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/405.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #405


commit 596b996b1aa694bb9a0924f9094ef688df23678e
Author: Sandesh Hegde 
Date:   2016-10-05T01:29:57Z

APEXCORE-505 Heartbeat loop was blocked waiting for operator activation, 
reason for this is that Stream activation(Only BufferServerSubscriber and 
WindowGenerator) waits for operator activation in heartbeat thread. After 
analysis and sanity testing, we don't see the need to have the synchronization 
between operator and stream activation.




> setup and activate calls in operator block heartbeat loop in container
> --
>
> Key: APEXCORE-505
> URL: https://issues.apache.org/jira/browse/APEXCORE-505
> Project: Apache Apex Core
>  Issue Type: Bug
>Reporter: Ashwin Chandra Putta
>Assignee: Sandesh
>
> The setup and activate calls in the operators block the StreamingContainer 
> heartbeat loop to send heartbeats to application master. As a result, if 
> activation/setup takes more than the heartbeat timeout for any given operator 
> within a container, the app master ends up killing the container for 
> heartbeat timeout even though container is active.
> To test this: I created simple test application and added sleep for 40 
> seconds in setup or activate calls. The application master shows the heart 
> beat timeout message and kills the container with operator. Please find the 
> stack trace on the container while it was active as follows:
> sleep in activate callback
> ===
> {code}
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode):
> "Attach Listener" daemon prio=10 tid=0x7f8060ac nid=0x6d2f waiting on 
> condition [0x]
>java.lang.Thread.State: RUNNABLE
> "1/randomGenerator:RandomNumberGenerator" prio=10 tid=0x7f8060b9b800 
> nid=0x65a2 waiting on condition [0x7f8050bc8000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at com.datatorrent.sample.MyOperator.activate(MyOperator.java:63)
>   at com.datatorrent.stram.engine.Node.activate(Node.java:619)
>   at 
> com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:205)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1336)
>   at 
> com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:130)
>   at 
> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1396)
> "ProcessWideEventLoop" prio=10 tid=0x7f8060afb800 nid=0x658b runnable 
> [0x7f8050cc9000]
>java.lang.Thread.State: RUNNABLE
>   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>   - locked <0x0007d186a380> (a 
> com.datatorrent.netlet.OptimizedEventLoop$SelectedSelectionKeySet)
>   - locked <0x0007d185a4a0> (a java.util.Collections$UnmodifiableSet)
>   - locked <0x0007d185a070> (a sun.nio.ch.EPollSelectorImpl)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:185)
>   at 
> com.datatorrent.netlet.OptimizedEventLoop.runEventLoop(OptimizedEventLoop.java:157)
>   at 
> com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:156)
>   at java.lang.Thread.run(Thread.java:745)
> "Dispatcher-0" daemon prio=10 tid=0x7f8060ae8000 nid=0x6586 waiting on 
> condition [0x7f8050dca000]
>

[GitHub] apex-core pull request #405: APEXCORE-505 Heartbeat loop was blocked waiting...

2016-10-04 Thread sandeshh
GitHub user sandeshh opened a pull request:

https://github.com/apache/apex-core/pull/405

APEXCORE-505 Heartbeat loop was blocked waiting for operator activati…

…on, the reason for this is that Stream activation(Only 
BufferServerSubscriber and WindowGenerator) waits for operator activation in a 
heartbeat thread. There is no need to have this synchronization, as Tuples are 
pulled from the queues by the operators.

@vrozov @tweise please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sandeshh/apex-core APEXCORE-505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/405.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #405


commit 596b996b1aa694bb9a0924f9094ef688df23678e
Author: Sandesh Hegde 
Date:   2016-10-05T01:29:57Z

APEXCORE-505 Heartbeat loop was blocked waiting for operator activation, 
reason for this is that Stream activation(Only BufferServerSubscriber and 
WindowGenerator) waits for operator activation in heartbeat thread. After 
analysis and sanity testing, we don't see the need to have the synchronization 
between operator and stream activation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-core pull request #404: APEXCORE-536 #resolve Upgrade Hadoop dependency...

2016-10-04 Thread davidyan74
GitHub user davidyan74 reopened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404


commit 6a85acb937aa425c7ca7c8feb3a5587e78d228ac
Author: David Yan 
Date:   2016-10-04T01:03:59Z

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547333#comment-15547333
 ] 

ASF GitHub Bot commented on APEXCORE-536:
-

GitHub user davidyan74 reopened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404


commit 6a85acb937aa425c7ca7c8feb3a5587e78d228ac
Author: David Yan 
Date:   2016-10-04T01:03:59Z

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0




> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>Assignee: David Yan
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547313#comment-15547313
 ] 

ASF GitHub Bot commented on APEXCORE-536:
-

Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-core/pull/404


> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>Assignee: David Yan
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-core pull request #404: APEXCORE-536 #resolve Upgrade Hadoop dependency...

2016-10-04 Thread davidyan74
Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-core/pull/404


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (APEXMALHAR-2269) AbstractFileInputOperator: During replay, IO errors not handled

2016-10-04 Thread Matt Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Zhang reassigned APEXMALHAR-2269:
--

Assignee: Matt Zhang

> AbstractFileInputOperator: During replay, IO errors not handled
> ---
>
> Key: APEXMALHAR-2269
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2269
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> In AbstractFileInputOperator, during replay(), if any IOExceptions occur, they
> are not handled gracefully -- the code simply throws a RuntimeException.
> Code similar to the behavior of emitTuples() needs to be added where
> the falureHandling() method is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (APEXMALHAR-2270) AbstractFileInputOperator: During replay, inputStream should skip tuples

2016-10-04 Thread Matt Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Zhang reassigned APEXMALHAR-2270:
--

Assignee: Matt Zhang

> AbstractFileInputOperator: During replay, inputStream should skip tuples
> 
>
> Key: APEXMALHAR-2270
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2270
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> In AbstractFileInputOperator.replay(), when a RecoveryEntry with a positive
> offset is processed, retryFailedFile() is called to open the file but there 
> is no code to skip tuples until we get to the desired offset as is done in 
> emitTuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (APEXMALHAR-2263) Offsets in AbstractFileInputOperator should be long rather than int

2016-10-04 Thread Matt Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Zhang reassigned APEXMALHAR-2263:
--

Assignee: Matt Zhang

> Offsets in AbstractFileInputOperator should be long rather than int
> ---
>
> Key: APEXMALHAR-2263
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2263
> Project: Apache Apex Malhar
>  Issue Type: Bug
>  Components: adapters other
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> Offsets in AbstractFileInputOperator use the int type which means files with 
> more that (2**31 -1) records will cause overflows and mysterious failures.
> Should be changed to use long.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread David Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Yan reassigned APEXCORE-536:
--

Assignee: David Yan

> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>Assignee: David Yan
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547171#comment-15547171
 ] 

ASF GitHub Bot commented on APEXCORE-536:
-

GitHub user davidyan74 reopened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404


commit da59054fb737bfd9ab0b123fd711a0b273d765e2
Author: David Yan 
Date:   2016-10-04T01:03:59Z

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0




> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547170#comment-15547170
 ] 

ASF GitHub Bot commented on APEXCORE-536:
-

Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-core/pull/404


> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-core pull request #404: APEXCORE-536 #resolve Upgrade Hadoop dependency...

2016-10-04 Thread davidyan74
GitHub user davidyan74 reopened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404


commit da59054fb737bfd9ab0b123fd711a0b273d765e2
Author: David Yan 
Date:   2016-10-04T01:03:59Z

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] apex-core pull request #404: APEXCORE-536 #resolve Upgrade Hadoop dependency...

2016-10-04 Thread davidyan74
Github user davidyan74 closed the pull request at:

https://github.com/apache/apex-core/pull/404


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXCORE-536) Upgrade Hadoop dependency

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547010#comment-15547010
 ] 

ASF GitHub Bot commented on APEXCORE-536:
-

GitHub user davidyan74 opened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404






> Upgrade Hadoop dependency
> -
>
> Key: APEXCORE-536
> URL: https://issues.apache.org/jira/browse/APEXCORE-536
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Thomas Weise
>  Labels: roadmap
>
> Currently Apex depends on Hadoop 2.2 and runs on all later 2.x version. 
> Hadoop 2.2 is quite old, most Apex users have more recent Hadoop installs. 
> Latest distro releases are based on 2.6 and 2.7. There are several important 
> features that were added in Hadoop since 2.2 that Apex should be able to 
> leverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-core pull request #404: APEXCORE-536 #resolve Upgrade Hadoop dependency...

2016-10-04 Thread davidyan74
GitHub user davidyan74 opened a pull request:

https://github.com/apache/apex-core/pull/404

APEXCORE-536 #resolve Upgrade Hadoop dependency version from 2.2.0 to 2.6.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davidyan74/apex-core APEXCORE-536

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-core/pull/404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #404






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (APEXCORE-547) Strict & lenient Physical Plan checking

2016-10-04 Thread Sandesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXCORE-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandesh updated APEXCORE-547:
-
Description: 
This will need a bigger discussion, filing here so that it can be tracked.

Example:

Container max memory is set by yarn settings. So multiple operators in a 
container with the high memory requirement may not get the sufficient memory, 
as Apex just allocates the yarn allowed max memory and launches the 
application. 

Users should have a control saying don't launch the app if memory requirements 
are not met.





  was:
This will need a bigger discussion, filing here so that it can be tracked.

Example:

Container max memory is set by yarn settings. So multiple operators in a 
container with the high memory requirement may not get the sufficient memory, 
as Apex just allocates the yarn allowed max memory and launches the 
application. 

Users should have a control saying don't launch the app if memory requirements 
are not set.






> Strict & lenient Physical Plan checking
> ---
>
> Key: APEXCORE-547
> URL: https://issues.apache.org/jira/browse/APEXCORE-547
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Sandesh
>
> This will need a bigger discussion, filing here so that it can be tracked.
> Example:
> Container max memory is set by yarn settings. So multiple operators in a 
> container with the high memory requirement may not get the sufficient memory, 
> as Apex just allocates the yarn allowed max memory and launches the 
> application. 
> Users should have a control saying don't launch the app if memory 
> requirements are not met.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXCORE-547) Strict & lenient Physical Plan checking

2016-10-04 Thread Sandesh (JIRA)
Sandesh created APEXCORE-547:


 Summary: Strict & lenient Physical Plan checking
 Key: APEXCORE-547
 URL: https://issues.apache.org/jira/browse/APEXCORE-547
 Project: Apache Apex Core
  Issue Type: Improvement
Reporter: Sandesh


This will need a bigger discussion, filing here so that it can be tracked.

Example:

Container max memory is set by yarn settings. So multiple operators in a 
container with the high memory requirement may not get the sufficient memory, 
as Apex just allocates the yarn allowed max memory and launches the 
application. 

Users should have a control saying don't launch the app if memory requirements 
are not set.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2276) ManagedState: value of a key does not get over-written in the same time bucket

2016-10-04 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546623#comment-15546623
 ] 

Chandni Singh commented on APEXMALHAR-2276:
---

Yes time gets mapped to a time bucket, however the exact value of time is lost 
and currently we cannot derive time from the timebucket. 

This affects cases where we want to save the most recent value of a key. In 
use-cases, like de-duplication/aggregation this is not the case.

In De-duplication we check whether there is a key present and drop the event if 
its duplicate. 
In Aggregation, we use the existing value in the time-bucket and aggregate it 
with new value.


> ManagedState: value of a key does not get over-written in the same time bucket
> --
>
> Key: APEXMALHAR-2276
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2276
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Siyuan Hua
>Assignee: Chandni Singh
> Fix For: 3.6.0
>
>
> For example:
> ManagedTimeUnifiedStateImpl mtus;
> mtus.put(1, key1, val1)
> mtus.put(1, key1, val2)
> mtus.get(1, key1).equals(val2) will return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (APEXMALHAR-2276) ManagedState: value of a key does not get over-written in the same time bucket

2016-10-04 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated APEXMALHAR-2276:
--
Comment: was deleted

(was: Yes time gets mapped to a time bucket, however the exact value of time is 
lost and currently we cannot derive time from the timebucket. 

This affects cases where we want to save the most recent value of a key. In 
use-cases, like de-duplication/aggregation this is not the case.

In De-duplication we check whether there is a key present and drop the event if 
its duplicate. 
In Aggregation, we use the existing value in the time-bucket and aggregate it 
with new value.
)

> ManagedState: value of a key does not get over-written in the same time bucket
> --
>
> Key: APEXMALHAR-2276
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2276
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Siyuan Hua
>Assignee: Chandni Singh
> Fix For: 3.6.0
>
>
> For example:
> ManagedTimeUnifiedStateImpl mtus;
> mtus.put(1, key1, val1)
> mtus.put(1, key1, val2)
> mtus.get(1, key1).equals(val2) will return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2276) ManagedState: value of a key does not get over-written in the same time bucket

2016-10-04 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546592#comment-15546592
 ] 

Chandni Singh commented on APEXMALHAR-2276:
---

As [~chaithu] pointed out, we need to compare time and not time buckets. 
The fix for that will not be simple. For each key/value now we have to remember 
the latest time. So far bucket didn't have reference to time which means the 
bucket API will probably change as well.


> ManagedState: value of a key does not get over-written in the same time bucket
> --
>
> Key: APEXMALHAR-2276
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2276
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Siyuan Hua
>Assignee: Chandni Singh
> Fix For: 3.6.0
>
>
> For example:
> ManagedTimeUnifiedStateImpl mtus;
> mtus.put(1, key1, val1)
> mtus.put(1, key1, val2)
> mtus.get(1, key1).equals(val2) will return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2276) ManagedState: value of a key does not get over-written in the same time bucket

2016-10-04 Thread Siyuan Hua (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546598#comment-15546598
 ] 

Siyuan Hua commented on APEXMALHAR-2276:


Time will be mapped to a time bucket anyways, correct?

> ManagedState: value of a key does not get over-written in the same time bucket
> --
>
> Key: APEXMALHAR-2276
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2276
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Siyuan Hua
>Assignee: Chandni Singh
> Fix For: 3.6.0
>
>
> For example:
> ManagedTimeUnifiedStateImpl mtus;
> mtus.put(1, key1, val1)
> mtus.put(1, key1, val2)
> mtus.get(1, key1).equals(val2) will return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2254) File input operator is not idempotent with closing files on replay

2016-10-04 Thread Munagala V. Ramanath (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546572#comment-15546572
 ] 

Munagala V. Ramanath commented on APEXMALHAR-2254:
--

Here is a list of other JIRAs related to this operator:
APEXMALHAR-2250 AbstractFileInputOperator.DirectoryScanner does not handle 
directories correctly.
APEXMALHAR-2270 AbstractFileInputOperator: During replay, inputStream should 
skip tuples
APEXMALHAR-2269 AbstractFileInputOperator: During replay, IO errors not handled
APEXMALHAR-2263 Offsets in AbstractFileInputOperator should be long rather than 
int
APEXMALHAR-2021 Add property to AbstractFileInputOperator to trim 
processedFiles and ignoredFiles
APEXMALHAR-2268 AbstractFileInputOperator: During replay, readEntity may be 
called without calling openFile.
APEXMALHAR-2274 AbstractFileInputOperator gets killed when there are a large 
number of files.


> File input operator is not idempotent with closing files on replay
> --
>
> Key: APEXMALHAR-2254
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2254
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> With the file input operator, on a replay in a failure scenario, the same 
> data is output as before the failure, for every window that is being replayed 
> after checkpoint. To do this the operator keeps track of the files and 
> offsets for every window and replays the data based on that. 
> However, if it so happens that before the failure the processing of a file 
> was finished and it was closed exactly before the end window and the next 
> file was opened and processed in a new window, in the replay the closing of 
> the first file does not happen in earlier window but happens in the latter 
> window. This can cause problems if an operator depends on the closing file 
> also to happen in an idempotent manner.
> Improve the operator to save the closing and opening of files in the 
> idempotent state as well so that it can also happen in an idempotent manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (APEXMALHAR-2254) File input operator is not idempotent with closing files on replay

2016-10-04 Thread Munagala V. Ramanath (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546182#comment-15546182
 ] 

Munagala V. Ramanath edited comment on APEXMALHAR-2254 at 10/4/16 6:45 PM:
---

It would be useful to define what we mean for this operator 
(AbstractFileInputOperator) to be idempotent: It doesn't quite work to define 
it as "same tuples output in the same windows, given same input" because this 
operator is abstract, has no ports and emits no tuples.

Concrete subclasses are expected to implement the emit() method to actually 
emit tuples. So a definition that involves calls to emit(), openFile(), 
closeFile() may be more meaningful. Also, clarifying what is expected of 
subclasses in order to maintain idempotency will also be useful.

Finally, there are several other JIRAs related to this class and it may be 
useful to keep them in mind too.


was (Author: dtram):
It would be useful to define what we mean for this operator 
(AbstractFileInputOperator) to be idempotent: It doesn't quite to define it as 
"same tuples output in the same windows, given same input" because this 
operator is abstract, has no ports and emits no tuples.

Concrete subclasses are expected to implement the emit() method to actually 
emit tuples. So a definition that involves calls to emit(), openFile(), 
closeFile() may be more meaningful. Also, clarifying what is expected of 
subclasses in order to maintain idempotency will also be useful.

Finally, there are several other JIRAs related to this class and it may be 
useful to keep them in mind too.

> File input operator is not idempotent with closing files on replay
> --
>
> Key: APEXMALHAR-2254
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2254
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> With the file input operator, on a replay in a failure scenario, the same 
> data is output as before the failure, for every window that is being replayed 
> after checkpoint. To do this the operator keeps track of the files and 
> offsets for every window and replays the data based on that. 
> However, if it so happens that before the failure the processing of a file 
> was finished and it was closed exactly before the end window and the next 
> file was opened and processed in a new window, in the replay the closing of 
> the first file does not happen in earlier window but happens in the latter 
> window. This can cause problems if an operator depends on the closing file 
> also to happen in an idempotent manner.
> Improve the operator to save the closing and opening of files in the 
> idempotent state as well so that it can also happen in an idempotent manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2254) File input operator is not idempotent with closing files on replay

2016-10-04 Thread Munagala V. Ramanath (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546182#comment-15546182
 ] 

Munagala V. Ramanath commented on APEXMALHAR-2254:
--

It would be useful to define what we mean for this operator 
(AbstractFileInputOperator) to be idempotent: It doesn't quite to define it as 
"same tuples output in the same windows, given same input" because this 
operator is abstract, has no ports and emits no tuples.

Concrete subclasses are expected to implement the emit() method to actually 
emit tuples. So a definition that involves calls to emit(), openFile(), 
closeFile() may be more meaningful. Also, clarifying what is expected of 
subclasses in order to maintain idempotency will also be useful.

Finally, there are several other JIRAs related to this class and it may be 
useful to keep them in mind too.

> File input operator is not idempotent with closing files on replay
> --
>
> Key: APEXMALHAR-2254
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2254
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> With the file input operator, on a replay in a failure scenario, the same 
> data is output as before the failure, for every window that is being replayed 
> after checkpoint. To do this the operator keeps track of the files and 
> offsets for every window and replays the data based on that. 
> However, if it so happens that before the failure the processing of a file 
> was finished and it was closed exactly before the end window and the next 
> file was opened and processed in a new window, in the replay the closing of 
> the first file does not happen in earlier window but happens in the latter 
> window. This can cause problems if an operator depends on the closing file 
> also to happen in an idempotent manner.
> Improve the operator to save the closing and opening of files in the 
> idempotent state as well so that it can also happen in an idempotent manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2220) Move the FunctionOperator to Malhar library

2016-10-04 Thread Dongming Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545968#comment-15545968
 ] 

Dongming Liang commented on APEXMALHAR-2220:


[~hsy541] There seems to be a circular dependency problem if we move them to 
org.apache.apex.malhar.lib
it (malhar-library) depends on malhar-stream, however at the same time 
malhar-stream depends on malhar-library

How about we merge them under org.apache.apex.malhar.lib in malhar-stream?

> Move the FunctionOperator to Malhar library
> ---
>
> Key: APEXMALHAR-2220
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2220
> Project: Apache Apex Malhar
>  Issue Type: Improvement
>Reporter: Siyuan Hua
>Assignee: Dongming Liang
>
> FunctionOperator initially is just designed for high-level API and we think 
> it can also useful if people want to build stateless transformation and work 
> with other operator directly. FunctionOperator can be reused. Thus we should 
> move FO to malhar library



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (APEXMALHAR-2276) ManagedState: value of a key does not get over-written in the same time bucket

2016-10-04 Thread Chandni Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reopened APEXMALHAR-2276:
---

> ManagedState: value of a key does not get over-written in the same time bucket
> --
>
> Key: APEXMALHAR-2276
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2276
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Siyuan Hua
>Assignee: Chandni Singh
> Fix For: 3.6.0
>
>
> For example:
> ManagedTimeUnifiedStateImpl mtus;
> mtus.put(1, key1, val1)
> mtus.put(1, key1, val2)
> mtus.get(1, key1).equals(val2) will return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Hadoop upgrade

2016-10-04 Thread Munagala Ramanath
+1 for 2.6.x

Ram

On Mon, Oct 3, 2016 at 1:47 PM, David Yan  wrote:

> Hi all,
>
> Thomas created this ticket for upgrading our Hadoop dependency version a
> couple weeks ago:
>
> https://issues.apache.org/jira/browse/APEXCORE-536
>
> We'd like to get the ball rolling and would like to take a vote from the
> community which version we would like to upgrade to. We have these choices:
>
> 2.2.0 (no upgrade)
> 2.4.x
> 2.5.x
> 2.6.x
>
> We are not considering 2.7.x because we already know that many Apex users
> are using Hadoop distros that are based on 2.6.
>
> Please note that Apex works with all versions of Hadoop higher or equal to
> the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
> considering Hadoop 3.0.0-alpha yet at this time.
>
> When voting, please keep these in mind:
>
> - The features that are added in 2.4.x, 2.5.x, and 2.6.x respectively, and
> how useful those features are for Apache Apex
> - The Hadoop versions the major distros (Cloudera, Hortonworks, MapR, EMR,
> etc) are supporting
> - The Hadoop versions what typical Apex users are using
>
> Thanks,
>
> David
>


Re: [VOTE] Hadoop upgrade

2016-10-04 Thread Vlad Rozov

+1 for 2.6.x

Vlad

On 10/3/16 13:47, David Yan wrote:

Hi all,

Thomas created this ticket for upgrading our Hadoop dependency version a
couple weeks ago:

https://issues.apache.org/jira/browse/APEXCORE-536

We'd like to get the ball rolling and would like to take a vote from the
community which version we would like to upgrade to. We have these choices:

2.2.0 (no upgrade)
2.4.x
2.5.x
2.6.x

We are not considering 2.7.x because we already know that many Apex users
are using Hadoop distros that are based on 2.6.

Please note that Apex works with all versions of Hadoop higher or equal to
the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
considering Hadoop 3.0.0-alpha yet at this time.

When voting, please keep these in mind:

- The features that are added in 2.4.x, 2.5.x, and 2.6.x respectively, and
how useful those features are for Apache Apex
- The Hadoop versions the major distros (Cloudera, Hortonworks, MapR, EMR,
etc) are supporting
- The Hadoop versions what typical Apex users are using

Thanks,

David





Re: Writing to external systems in reconciled fashion

2016-10-04 Thread Tushar Gosavi
Hi Priyanka,

tuples stored in HFile will not be replayed in same order at the
output as HFile will save tuples in different order. If order is
important then you could use
org.apache.apex.malhar.lib.wal.FileSystemWAL which is like a on-disk
queue. Or your could directly
use more high level SpillableArrayListImpl.

- Tushar.

On Tue, Oct 4, 2016 at 3:48 PM, Priyanka Gugale
 wrote:
> I am specifically looking into HFiles when I mentioned Hbase.
>
> -Priyanka
>
> On Tue, Oct 4, 2016 at 3:47 PM, Priyanka Gugale 
> wrote:
>
>> I am reconsidering using WindowDataManager or Hbase instead of
>> ManagedState to implement SpillableQueue.
>> Let me know if anyone have any thoughts on same.
>>
>> -Priyanka
>>
>> On Thu, Sep 29, 2016 at 5:54 PM, Priyanka Gugale > > wrote:
>>
>>> Hi,
>>>
>>> Please refer to this link for earlier discussion reference:
>>> https://lists.apache.org/list.html?dev@apex.apache.org
>>>
>>> -Priyanka
>>>
>>> On Thu, Sep 29, 2016 at 5:46 PM, Priyanka Gugale <
>>> priya...@datatorrent.com> wrote:
>>>
 Dear community,

 We had a discussion
 
 before about having reconciler for writing to JDBC output operator. I am
 proposing to write  a reconciler plugin which should be generic enough to
 work with most of output Systems. Please refer this
 
 document for design details. We also have a jira
  for same.

 Please provide your feedback.

 -Priyanka


>>>
>>


Re: [VOTE] Hadoop upgrade

2016-10-04 Thread Tushar Gosavi
+1 for 2.6

- Tushar.


On Tue, Oct 4, 2016 at 3:18 PM, Pradeep A. Dalvi  wrote:
> +1 for 2.6.x
>
> On Tuesday, October 4, 2016, David Yan  wrote:
>
>> Hi all,
>>
>> Thomas created this ticket for upgrading our Hadoop dependency version a
>> couple weeks ago:
>>
>> https://issues.apache.org/jira/browse/APEXCORE-536
>>
>> We'd like to get the ball rolling and would like to take a vote from the
>> community which version we would like to upgrade to. We have these choices:
>>
>> 2.2.0 (no upgrade)
>> 2.4.x
>> 2.5.x
>> 2.6.x
>>
>> We are not considering 2.7.x because we already know that many Apex users
>> are using Hadoop distros that are based on 2.6.
>>
>> Please note that Apex works with all versions of Hadoop higher or equal to
>> the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
>> considering Hadoop 3.0.0-alpha yet at this time.
>>
>> When voting, please keep these in mind:
>>
>> - The features that are added in 2.4.x, 2.5.x, and 2.6.x respectively, and
>> how useful those features are for Apache Apex
>> - The Hadoop versions the major distros (Cloudera, Hortonworks, MapR, EMR,
>> etc) are supporting
>> - The Hadoop versions what typical Apex users are using
>>
>> Thanks,
>>
>> David
>>


Re: [VOTE] Hadoop upgrade

2016-10-04 Thread Pradeep A. Dalvi
+1 for 2.6.x

On Tuesday, October 4, 2016, David Yan  wrote:

> Hi all,
>
> Thomas created this ticket for upgrading our Hadoop dependency version a
> couple weeks ago:
>
> https://issues.apache.org/jira/browse/APEXCORE-536
>
> We'd like to get the ball rolling and would like to take a vote from the
> community which version we would like to upgrade to. We have these choices:
>
> 2.2.0 (no upgrade)
> 2.4.x
> 2.5.x
> 2.6.x
>
> We are not considering 2.7.x because we already know that many Apex users
> are using Hadoop distros that are based on 2.6.
>
> Please note that Apex works with all versions of Hadoop higher or equal to
> the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
> considering Hadoop 3.0.0-alpha yet at this time.
>
> When voting, please keep these in mind:
>
> - The features that are added in 2.4.x, 2.5.x, and 2.6.x respectively, and
> how useful those features are for Apache Apex
> - The Hadoop versions the major distros (Cloudera, Hortonworks, MapR, EMR,
> etc) are supporting
> - The Hadoop versions what typical Apex users are using
>
> Thanks,
>
> David
>


Re: Generic Malhar operator to get formatted data.

2016-10-04 Thread Deepak Narkhede
Hi Priyanka,
Yes it is template engine. Initial use case to target is to generate
automated emails using template for alerts, logging might be also  for
monitoring.
Also to convert POJO's or Map or List in XML or JSON format to further
absorb by other operators depending on templates.

Thanks,
Deepak





On Tue, Oct 4, 2016 at 1:46 PM, Priyanka Gugale 
wrote:

> As far as I know "freemarker" is a template engine. How does it help us to
> achieve our usecase? Like how does it help in parsing map or POJO. Also how
> does it help in generating output of forms other than HTML?
>
> -Priyanla
>
> On Tue, Oct 4, 2016 at 12:54 PM, Deepak Narkhede 
> wrote:
>
> > Hi Folks,
> >
> > Planning to write an malhar operator which will take Map or POJO as input
> > and provide output as  formatted data string as per specified template
> > data.
> >
> > Use Cases:
> > Get data in output data format like XML, HTML etc.
> > Generate automated emails etc.
> > Generate configuration files also source code in some cases.
> > Generate Date/time format etc.
> >
> > How:
> > Planning to use template engine library "Freemarker" currently under
> Apache
> > License.
> > Investigated  libraries like freemarker, thymeleaf and velocity.
> >
> > Strength of Freemarker with respect ot others mentioned above:
> > Ease of use, almost zero dependencies, light weight (faster processing)
> >
> > Please let me know your thoughts/suggestions on this.
> >
> > Thanks,
> > Deepak
> >
> >
> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> >
> > Deepak Narkhede
> >
>



-- 
Thanks & Regards

Deepak Narkhede


Generic Malhar operator to get formatted data.

2016-10-04 Thread Deepak Narkhede
Hi Folks,

Planning to write an malhar operator which will take Map or POJO as input
and provide output as  formatted data string as per specified template
data.

Use Cases:
Get data in output data format like XML, HTML etc.
Generate automated emails etc.
Generate configuration files also source code in some cases.
Generate Date/time format etc.

How:
Planning to use template engine library "Freemarker" currently under Apache
License.
Investigated  libraries like freemarker, thymeleaf and velocity.

Strength of Freemarker with respect ot others mentioned above:
Ease of use, almost zero dependencies, light weight (faster processing)

Please let me know your thoughts/suggestions on this.

Thanks,
Deepak







-- 
Thanks & Regards

Deepak Narkhede


Re: Fixed Width Record Parser

2016-10-04 Thread Hitesh Kapoor
Hi All,

Thank you for thee feedback.
I will use univocity for *parsing (only) , *will do the type
checking/validation manually.
Input schema is similar to that of CSV so will have to create another base
class Schema (having common elements of Delimited and Fixed width
Schema)and Delimited and Fixed Width Schema will inherit from it.
Will use POJOUtils for constructing POJO.

Regards,
Hitesh


On Tue, Oct 4, 2016 at 11:59 AM, Shubham Pathak 
wrote:

> Hi Hitesh,
>
> I agree with Chinmay. -1 for creating our own library.
>
> +1 for using Univocity.
> For input schema, I suggest we use the same one as used by Delimited
> Parser. We would need to add fields to accept padding character,
> startingCharacterPosition and endingCharacterPosition.
>
> To construct the POJOs you may use PojoUtils
>  library/src/main/java/com/datatorrent/lib/util/PojoUtils.java>
>
> Thanks,
> Shubham
>
>
>
>
>
> On Mon, Oct 3, 2016 at 4:09 AM, Chinmay Kolhatkar  >
> wrote:
>
> > Hi Hitesh,
> >
> > In general I'm not in favor of reinventing the wheels. Because, for one,
> it
> > takes effort to maintain the library, secondly, self written library
> might
> > take longer time to mature and become stable for production use.
> >
> > Hence, -1 from me for creating own library for fixed length parsing.
> >
> > I saw the libraries that you proposed and want to add one more library to
> > the list - jFFP (http://jffp.sourceforge.net/).
> >
> > To me jFFP and univocity looks good options. I'm personally more inclined
> > towards univocity because it seems to be active in development (last
> commit
> > 4 days ago) and secondly this library has been used in Fixed Length File
> > Loader for Enrichment.
> >
> > My overall vote is to use univocity as much as possible and if there is
> any
> > missing (& important to us) feature in univocity, that should be added
> over
> > top in our operator.
> >
> > Thanks,
> > Chinmay.
> >
> >
> > On Mon, Oct 3, 2016 at 2:12 PM, Hitesh Kapoor 
> > wrote:
> >
> > > Hi All,
> > >
> > > Thank you for your feedback.
> > > So as per the votes/comments, I will not be going ahead with approach 2
> > as
> > > it is not clean.
> > >
> > > For approach 1, I have looked at the possibility to use existing
> parsing
> > > libraries like flatworm, flatpack, univocity,
> > > following are the problems with using exisiting libraries:
> > > 1) These libraries take input schema in a specific format and are
> > > complicated to use.
> > > For example the most famous library (as per stackoverflow) flatworm
> will
> > > involve giving the input schema in Xml format (refer
> > > http://flatworm.sourceforge.net/) so we will loose our consistency
> with
> > > existing parsers like CsvParser, where we take i/p in JSON format. Not
> > only
> > > the consistency it will be more difficult for the user to give input in
> > > flatworm specific XML.
> > > If we decide to convert our JSON to Flatworm specific Xml, it will
> > involve
> > > lot more work then to write your own library.
> > > 2)  Does only limited type checking for example for a Date type if it
> > > adheres to dd/mm/, a date may parse correctly for i/p 12/13/2000
> > (month
> > > is beyond 12) .
> > > 3) Difficult to handle Boolean and Date datatypes.
> > > 4) Future scalability may take a hit. For example if we want to add
> more
> > > constraints to our parser like min value for an integer or a pattern
> for
> > a
> > > string , it won't be possible to do it with existing libraries.
> > > 5) To retrieve the values to create a POJO is not user (coder)
> friendly.
> > >
> > > According to me we should write our own library to do the parsing and
> > > validation  as to use an existing library will involve more work.
> > > The work involved in coding the library is easy and straightforward.
> > > It will be easier for us to scale and also provide an easy life for the
> > end
> > > user to provide the input schema.
> > > The reason we are not going ahead with approach 2 is that it is not
> > clean,
> > > the twisting and turning involved in using (forcefully using) existing
> > > libraries appears more dirty to me.
> > >
> > > Regards,
> > > Hitesh
> > >
> > >
> > >
> > > On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra <
> > > devendra.vyavah...@gmail.com>
> > > wrote:
> > >
> > > > If we specify order of the fields and length for each field then
> start,
> > > end
> > > > can be computed.
> > > > Why do we need end user to specify start position for each field?
> > > >
> > > > ~ Yogi
> > > >
> > > > On 8 September 2016 at 12:48, Chinmay Kolhatkar <
> > chin...@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > Few points/questions:
> > > > > 1. Agree with Yogi. Approach 2 does not look clean.
> > > > > 2. Do we need "recordwidthlength"?
> > > > > 3. "recordseperator" should be "\n" and not "/n".
> > > > > 4. In general, providing schema as a 

Re: Fixed Width Record Parser

2016-10-04 Thread Shubham Pathak
Hi Hitesh,

I agree with Chinmay. -1 for creating our own library.

+1 for using Univocity.
For input schema, I suggest we use the same one as used by Delimited
Parser. We would need to add fields to accept padding character,
startingCharacterPosition and endingCharacterPosition.

To construct the POJOs you may use PojoUtils


Thanks,
Shubham





On Mon, Oct 3, 2016 at 4:09 AM, Chinmay Kolhatkar 
wrote:

> Hi Hitesh,
>
> In general I'm not in favor of reinventing the wheels. Because, for one, it
> takes effort to maintain the library, secondly, self written library might
> take longer time to mature and become stable for production use.
>
> Hence, -1 from me for creating own library for fixed length parsing.
>
> I saw the libraries that you proposed and want to add one more library to
> the list - jFFP (http://jffp.sourceforge.net/).
>
> To me jFFP and univocity looks good options. I'm personally more inclined
> towards univocity because it seems to be active in development (last commit
> 4 days ago) and secondly this library has been used in Fixed Length File
> Loader for Enrichment.
>
> My overall vote is to use univocity as much as possible and if there is any
> missing (& important to us) feature in univocity, that should be added over
> top in our operator.
>
> Thanks,
> Chinmay.
>
>
> On Mon, Oct 3, 2016 at 2:12 PM, Hitesh Kapoor 
> wrote:
>
> > Hi All,
> >
> > Thank you for your feedback.
> > So as per the votes/comments, I will not be going ahead with approach 2
> as
> > it is not clean.
> >
> > For approach 1, I have looked at the possibility to use existing parsing
> > libraries like flatworm, flatpack, univocity,
> > following are the problems with using exisiting libraries:
> > 1) These libraries take input schema in a specific format and are
> > complicated to use.
> > For example the most famous library (as per stackoverflow) flatworm will
> > involve giving the input schema in Xml format (refer
> > http://flatworm.sourceforge.net/) so we will loose our consistency with
> > existing parsers like CsvParser, where we take i/p in JSON format. Not
> only
> > the consistency it will be more difficult for the user to give input in
> > flatworm specific XML.
> > If we decide to convert our JSON to Flatworm specific Xml, it will
> involve
> > lot more work then to write your own library.
> > 2)  Does only limited type checking for example for a Date type if it
> > adheres to dd/mm/, a date may parse correctly for i/p 12/13/2000
> (month
> > is beyond 12) .
> > 3) Difficult to handle Boolean and Date datatypes.
> > 4) Future scalability may take a hit. For example if we want to add more
> > constraints to our parser like min value for an integer or a pattern for
> a
> > string , it won't be possible to do it with existing libraries.
> > 5) To retrieve the values to create a POJO is not user (coder) friendly.
> >
> > According to me we should write our own library to do the parsing and
> > validation  as to use an existing library will involve more work.
> > The work involved in coding the library is easy and straightforward.
> > It will be easier for us to scale and also provide an easy life for the
> end
> > user to provide the input schema.
> > The reason we are not going ahead with approach 2 is that it is not
> clean,
> > the twisting and turning involved in using (forcefully using) existing
> > libraries appears more dirty to me.
> >
> > Regards,
> > Hitesh
> >
> >
> >
> > On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra <
> > devendra.vyavah...@gmail.com>
> > wrote:
> >
> > > If we specify order of the fields and length for each field then start,
> > end
> > > can be computed.
> > > Why do we need end user to specify start position for each field?
> > >
> > > ~ Yogi
> > >
> > > On 8 September 2016 at 12:48, Chinmay Kolhatkar <
> chin...@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Few points/questions:
> > > > 1. Agree with Yogi. Approach 2 does not look clean.
> > > > 2. Do we need "recordwidthlength"?
> > > > 3. "recordseperator" should be "\n" and not "/n".
> > > > 4. In general, providing schema as a JSON is tedious from user
> > > perspective.
> > > > I suggest we find a simpler format for specifying schema. For eg.
> > > > ,,,
> > > > 5. I suggest we provide basic parser first to malhar which does only
> > > > parsing and type checking. Constraints, IMO are not part of parsing
> > > module
> > > > OR if needed can be added as phase 2 improvisation of this parser.
> > > > 6. I would suggest to use some existing library for parsing. There is
> > no
> > > > point in re-inventing the wheels and trying to make something robust
> > can
> > > be
> > > > time consuming.
> > > >
> > > > -Chinmay.
> > > >
> > > >
> > > > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> > > > devendra.vyavah...@gmail.com>
> > > > wrote:
> > > >
> > > 

Re: [VOTE] Hadoop upgrade

2016-10-04 Thread Chinmay Kolhatkar
+1 for 2.6

On Tue, Oct 4, 2016 at 6:19 AM, Bhupesh Chawda 
wrote:

> +1 for 2.6
>
> ~ Bhupesh
>
> On Oct 4, 2016 4:14 AM, "Siyuan Hua"  wrote:
>
> > +1 for 2.6.x
> >
> > On Mon, Oct 3, 2016 at 3:41 PM, Pramod Immaneni 
> > wrote:
> >
> > > +1 for 2.6.x
> > >
> > > On Mon, Oct 3, 2016 at 1:47 PM, David Yan 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thomas created this ticket for upgrading our Hadoop dependency
> version
> > a
> > > > couple weeks ago:
> > > >
> > > > https://issues.apache.org/jira/browse/APEXCORE-536
> > > >
> > > > We'd like to get the ball rolling and would like to take a vote from
> > the
> > > > community which version we would like to upgrade to. We have these
> > > choices:
> > > >
> > > > 2.2.0 (no upgrade)
> > > > 2.4.x
> > > > 2.5.x
> > > > 2.6.x
> > > >
> > > > We are not considering 2.7.x because we already know that many Apex
> > users
> > > > are using Hadoop distros that are based on 2.6.
> > > >
> > > > Please note that Apex works with all versions of Hadoop higher or
> equal
> > > to
> > > > the Hadoop version Apex depends on, as long as it's 2.x.x. We are not
> > > > considering Hadoop 3.0.0-alpha yet at this time.
> > > >
> > > > When voting, please keep these in mind:
> > > >
> > > > - The features that are added in 2.4.x, 2.5.x, and 2.6.x
> respectively,
> > > and
> > > > how useful those features are for Apache Apex
> > > > - The Hadoop versions the major distros (Cloudera, Hortonworks, MapR,
> > > EMR,
> > > > etc) are supporting
> > > > - The Hadoop versions what typical Apex users are using
> > > >
> > > > Thanks,
> > > >
> > > > David
> > > >
> > >
> >
>