[jira] [Updated] (APEXMALHAR-2332) StateTracker should make sure all the memory freed before remove the bucket from bucketHeap and bucketAccessTimes

2016-11-08 Thread bright chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bright chen updated APEXMALHAR-2332:

Summary: StateTracker should make sure all the memory freed before remove 
the bucket from bucketHeap and bucketAccessTimes  (was: StateTracker should 
free memory after committed)

> StateTracker should make sure all the memory freed before remove the bucket 
> from bucketHeap and bucketAccessTimes
> -
>
> Key: APEXMALHAR-2332
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2332
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: bright chen
>Assignee: bright chen
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Current StateTracker free memory was triggered by a timer. The default the 
> timer value was DAGContext.STREAMING_WINDOW_SIZE_MILLIS.defaultValue * 
> OperatorContext.APPLICATION_WINDOW_COUNT.defaultValue. It would have memory 
> leak if the process with operator thread and memory release thread as 
> following:
> bucket1: put(), put() ... put()
> bucket2: put(), put() ... put()
> freeMemory(): {bucket removed from bucketHeap and bucketAccessTimes}
> commit: bucket1, bucket2
> in this case,  nothing was freed and the bucket can't be freed any more
> And the default value of free memory could large and memory used up even 
> before get chance of free memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (APEXMALHAR-2332) StateTracker should free memory after committed

2016-11-08 Thread bright chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bright chen updated APEXMALHAR-2332:

Remaining Estimate: 120h  (was: 48h)
 Original Estimate: 120h  (was: 48h)
   Description: 
Current StateTracker free memory was triggered by a timer. The default the 
timer value was DAGContext.STREAMING_WINDOW_SIZE_MILLIS.defaultValue * 
OperatorContext.APPLICATION_WINDOW_COUNT.defaultValue. It would have memory 
leak if the process with operator thread and memory release thread as following:

bucket1: put(), put() ... put()
bucket2: put(), put() ... put()
freeMemory(): {bucket removed from bucketHeap and bucketAccessTimes}
commit: bucket1, bucket2

in this case,  nothing was freed and the bucket can't be freed any more

And the default value of free memory could large and memory used up even before 
get chance of free memory. 

  was:
StateTracker#bucketAccessTimes keep the bucket access time order by access 
time. It was used by free memory thread to decide which bucket can be freed. 
So, each access to bucket include put and get should update the access time.

As bucketAccessTimes and bucketHeap are shared by two thread. update them for 
each operation could impact the performance. It better to update period. As 
Bucket don't support window operation. I am going to keep the update time and 
update when time out.


   Summary: StateTracker should free memory after committed  (was: 
StateTracker#bucketAccessed should be called each time access to the bucket)

> StateTracker should free memory after committed
> ---
>
> Key: APEXMALHAR-2332
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2332
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: bright chen
>Assignee: bright chen
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Current StateTracker free memory was triggered by a timer. The default the 
> timer value was DAGContext.STREAMING_WINDOW_SIZE_MILLIS.defaultValue * 
> OperatorContext.APPLICATION_WINDOW_COUNT.defaultValue. It would have memory 
> leak if the process with operator thread and memory release thread as 
> following:
> bucket1: put(), put() ... put()
> bucket2: put(), put() ... put()
> freeMemory(): {bucket removed from bucketHeap and bucketAccessTimes}
> commit: bucket1, bucket2
> in this case,  nothing was freed and the bucket can't be freed any more
> And the default value of free memory could large and memory used up even 
> before get chance of free memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXCORE-570) Prevent upstream operators from getting too far ahead when downstream operators are slow

2016-11-08 Thread Pramod Immaneni (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXCORE-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649116#comment-15649116
 ] 

Pramod Immaneni commented on APEXCORE-570:
--

Also will look at how disabling buffer spooling can cause deadlocks.

> Prevent upstream operators from getting too far ahead when downstream 
> operators are slow
> 
>
> Key: APEXCORE-570
> URL: https://issues.apache.org/jira/browse/APEXCORE-570
> Project: Apache Apex Core
>  Issue Type: Improvement
>Reporter: Pramod Immaneni
>Assignee: Pramod Immaneni
>
> If the downstream operators are slower than upstream operators then the 
> upstream operators will get ahead and the gap can continue to increase. 
> Provide an option to slow down or temporarily pause the upstream operators 
> when they get too far ahead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXMALHAR-2332) StateTracker#bucketAccessed should be called each time access to the bucket

2016-11-08 Thread bright chen (JIRA)
bright chen created APEXMALHAR-2332:
---

 Summary: StateTracker#bucketAccessed should be called each time 
access to the bucket
 Key: APEXMALHAR-2332
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2332
 Project: Apache Apex Malhar
  Issue Type: Bug
Reporter: bright chen
Assignee: bright chen


StateTracker#bucketAccessTimes keep the bucket access time order by access 
time. It was used by free memory thread to decide which bucket can be freed. 
So, each access to bucket include put and get should update the access time.

As bucketAccessTimes and bucketHeap are shared by two thread. update them for 
each operation could impact the performance. It better to update period. As 
Bucket don't support window operation. I am going to keep the update time and 
update when time out.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2331) StateTracker#bucketAccessed should add bucket to bucketAccessTimes

2016-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648451#comment-15648451
 ] 

ASF GitHub Bot commented on APEXMALHAR-2331:


GitHub user brightchen opened a pull request:

https://github.com/apache/apex-malhar/pull/488

APEXMALHAR-2331 #resolve #comment StateTracker#bucketAccessed should …

…add bucket to bucketAccessTimes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brightchen/apex-malhar APEXMALHAR-2331

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/488.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #488


commit b1c829afb46fc6d505c58a4938901061e3fb38fc
Author: brightchen 
Date:   2016-11-08T18:53:43Z

APEXMALHAR-2331 #resolve #comment StateTracker#bucketAccessed should add 
bucket to bucketAccessTimes




> StateTracker#bucketAccessed should add bucket to bucketAccessTimes
> --
>
> Key: APEXMALHAR-2331
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2331
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: bright chen
>Assignee: bright chen
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The bucket didn't add to the bucketAccessTimes, which cause lots of 
> BucketIdTimeWrapper instances created and added to bucketHeap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #488: APEXMALHAR-2331 #resolve #comment StateTracke...

2016-11-08 Thread brightchen
GitHub user brightchen opened a pull request:

https://github.com/apache/apex-malhar/pull/488

APEXMALHAR-2331 #resolve #comment StateTracker#bucketAccessed should …

…add bucket to bucketAccessTimes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brightchen/apex-malhar APEXMALHAR-2331

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/488.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #488


commit b1c829afb46fc6d505c58a4938901061e3fb38fc
Author: brightchen 
Date:   2016-11-08T18:53:43Z

APEXMALHAR-2331 #resolve #comment StateTracker#bucketAccessed should add 
bucket to bucketAccessTimes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (APEXMALHAR-2331) StateTracker#bucketAccessed should add bucket to bucketAccessTimes

2016-11-08 Thread bright chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/APEXMALHAR-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bright chen reassigned APEXMALHAR-2331:
---

Assignee: bright chen

> StateTracker#bucketAccessed should add bucket to bucketAccessTimes
> --
>
> Key: APEXMALHAR-2331
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2331
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: bright chen
>Assignee: bright chen
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The bucket didn't add to the bucketAccessTimes, which cause lots of 
> BucketIdTimeWrapper instances created and added to bucketHeap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (APEXMALHAR-2331) StateTracker#bucketAccessed should add bucket to bucketAccessTimes

2016-11-08 Thread bright chen (JIRA)
bright chen created APEXMALHAR-2331:
---

 Summary: StateTracker#bucketAccessed should add bucket to 
bucketAccessTimes
 Key: APEXMALHAR-2331
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2331
 Project: Apache Apex Malhar
  Issue Type: Bug
Reporter: bright chen


The bucket didn't add to the bucketAccessTimes, which cause lots of 
BucketIdTimeWrapper instances created and added to bucketHeap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-08 Thread Munagala V. Ramanath (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647751#comment-15647751
 ] 

Munagala V. Ramanath commented on APEXMALHAR-2274:
--

The benefit of such an interface is not clear at this point; perhaps when the 
need for polymorphic use of such an interface arises, we can refactor.


> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2278) Implement Kudu Output Operator for non-transactional streams

2016-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647571#comment-15647571
 ] 

ASF GitHub Bot commented on APEXMALHAR-2278:


GitHub user ananthc opened a pull request:

https://github.com/apache/apex-malhar/pull/486

APEXMALHAR-2278 Implement KuduNonTransactional Output Operator

@tweise  / @PramodSSImmaneni  : Could you please review or suggest the 
right person who can complete this review for me? 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ananthc/apex-malhar 
APEXMALHAR-2278.KuduNonTransactionalOutputOperator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/486.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #486


commit af6944f53c86d0fb355798b9c48d0156964648c9
Author: ananthc 
Date:   2016-11-08T13:28:28Z

APEXMALHAR-2278 Implement KuduNonTransactional Output Operator




> Implement Kudu Output Operator for non-transactional streams
> 
>
> Key: APEXMALHAR-2278
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2278
> Project: Apache Apex Malhar
>  Issue Type: New Feature
>  Components: adapters database
>Reporter: Ananth
>Assignee: Ananth
>
> Here are some benefits of integrating Kudu and Apex:
> Kudu is just declared 1.0 and has just been declared production ready.
> Kudu as a store might a good a fit for many architectures in the years to 
> come because of its capabilities to provide mutability of data ( unlike HDFS 
> ) and optimized storage formats for low latency scans.
> It seems to also withstand high-throughput write patterns which makes it 
> a stable sink for Apex workflows which operate at very high volumes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #486: APEXMALHAR-2278 Implement KuduNonTransactiona...

2016-11-08 Thread ananthc
GitHub user ananthc opened a pull request:

https://github.com/apache/apex-malhar/pull/486

APEXMALHAR-2278 Implement KuduNonTransactional Output Operator

@tweise  / @PramodSSImmaneni  : Could you please review or suggest the 
right person who can complete this review for me? 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ananthc/apex-malhar 
APEXMALHAR-2278.KuduNonTransactionalOutputOperator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/486.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #486


commit af6944f53c86d0fb355798b9c48d0156964648c9
Author: ananthc 
Date:   2016-11-08T13:28:28Z

APEXMALHAR-2278 Implement KuduNonTransactional Output Operator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (APEXMALHAR-2284) POJOInnerJoinOperatorTest fails in Travis CI

2016-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647500#comment-15647500
 ] 

ASF GitHub Bot commented on APEXMALHAR-2284:


GitHub user chaithu14 opened a pull request:

https://github.com/apache/apex-malhar/pull/485

APEXMALHAR-2284 *Review Only* 

1) Implemented SpillableMap, SpillableArrayList, SpillableArrayListMultiMap 
over TimeSlicedBucketState 2) Integrated Spillable data structure into Inner 
Join operator

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chaithu14/incubator-apex-malhar 
APEXMALHAR-2284-SPDOverTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/485.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #485


commit 991e5660f52c266e75440fc0b7c03aa3eb4c05fe
Author: chaitanya 
Date:   2016-11-08T12:57:16Z

APEXMALHAR-2284 *Review Only* 1) Implemented SpillableMap, 
SpillableArrayList, SpillableArrayListMultiMap over TimeSlicedBucketState 2) 
Integrated Spillable data structure into Inner Join opeator




> POJOInnerJoinOperatorTest fails in Travis CI
> 
>
> Key: APEXMALHAR-2284
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2284
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Thomas Weise
>Assignee: Chaitanya
>Priority: Blocker
> Fix For: 3.6.0
>
>
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/166322754/log.txt
> {code}
> Failed tests: 
>   POJOInnerJoinOperatorTest.testEmitMultipleTuplesFromStream2:337 Number of 
> tuple emitted  expected:<2> but was:<4>
>   POJOInnerJoinOperatorTest.testInnerJoinOperator:184 Number of tuple emitted 
>  expected:<1> but was:<2>
>   POJOInnerJoinOperatorTest.testMultipleValues:236 Number of tuple emitted  
> expected:<2> but was:<3>
>   POJOInnerJoinOperatorTest.testUpdateStream1Values:292 Number of tuple 
> emitted  expected:<1> but was:<2>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] apex-malhar pull request #485: APEXMALHAR-2284 *Review Only*

2016-11-08 Thread chaithu14
GitHub user chaithu14 opened a pull request:

https://github.com/apache/apex-malhar/pull/485

APEXMALHAR-2284 *Review Only* 

1) Implemented SpillableMap, SpillableArrayList, SpillableArrayListMultiMap 
over TimeSlicedBucketState 2) Integrated Spillable data structure into Inner 
Join operator

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chaithu14/incubator-apex-malhar 
APEXMALHAR-2284-SPDOverTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/485.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #485


commit 991e5660f52c266e75440fc0b7c03aa3eb4c05fe
Author: chaitanya 
Date:   2016-11-08T12:57:16Z

APEXMALHAR-2284 *Review Only* 1) Implemented SpillableMap, 
SpillableArrayList, SpillableArrayListMultiMap over TimeSlicedBucketState 2) 
Integrated Spillable data structure into Inner Join opeator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (APEXMALHAR-2330) JdbcPOJOPollInputOperator fails with NullPointerException when PostgreSQL driver

2016-11-08 Thread Deepak Narkhede (JIRA)
Deepak Narkhede created APEXMALHAR-2330:
---

 Summary: JdbcPOJOPollInputOperator fails with NullPointerException 
when PostgreSQL driver
 Key: APEXMALHAR-2330
 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2330
 Project: Apache Apex Malhar
  Issue Type: Bug
Reporter: Deepak Narkhede
Assignee: Deepak Narkhede


Here is the description:
Problem: 
JdbcPOJOPollInputOperator fails with NullPointerException when PostgreSQL 
driver.

Problem Description:
1) When JdbcPOJOPollInputOperator tries to populateColumnDataTypes, column 
names retrieved from resultmetadata from database ( this case : Postgres) are 
all lowercase.
2) Whereas columnDatatypes specified in fieldinfos might be in same case.
3) Internally hashmap ( nameToType) is used which mismatches if column name and 
fieldinfo are not in same case. Hence columnDataTypes is empty which causes 
null exception in activate call.

Proposed Solution:
Using similar case for hashmap and column names irrespective of any database 
used for JdbcPOJOPollInputOperator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-08 Thread Tushar Gosavi (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647455#comment-15647455
 ] 

Tushar Gosavi commented on APEXMALHAR-2274:
---

we  could derive a common interface which could be used in both the operators 
for scanning files. Both are conceptually doing the same thing. we could have 
different implementaion of the interface, where one could just get the paths 
and other can get path as well as status.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)