[jira] [Resolved] (BEAM-3655) Port MaxPerKeyExamplesTest off DoFnTester

2018-10-09 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-3655.

   Resolution: Fixed
Fix Version/s: 2.8.0

> Port MaxPerKeyExamplesTest off DoFnTester
> -
>
> Key: BEAM-3655
> URL: https://issues.apache.org/jira/browse/BEAM-3655
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Aleksandr Kokhaniukov
>Priority: Major
>  Labels: beginner, newbie, starter
> Fix For: 2.8.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5607) Typing library no longer provisional in Python 3.7

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-5607:
--

Assignee: Alexey Romanenko  (was: Manu Zhang)

> Typing library no longer provisional in Python 3.7
> --
>
> Key: BEAM-5607
> URL: https://issues.apache.org/jira/browse/BEAM-5607
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Manu Zhang
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an issue coming from future since the target version in the [initial 
> proposal|https://lists.apache.org/thread.html/5371469de567357b1431606f766217ef73a9098dc45046f51a6ecceb@%3Cdev.beam.apache.org%3E]
>  is up to 3.6. Running python tests in 3.7 would throw the following error 
> because the [typing library is no longer provisional in Python 
> 3.7|https://github.com/python/typing#important-dates]
> {code}
> sdks/python/.eggs/typing-3.6.6-py3.7.egg/typing.py", line 1004, in __new__
> self._abc_registry = extra._abc_registry
> AttributeError: type object 'Callable' has no attribute '_abc_registry'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5607) Typing library no longer provisional in Python 3.7

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-5607:
--

Assignee: Manu Zhang  (was: Alexey Romanenko)

> Typing library no longer provisional in Python 3.7
> --
>
> Key: BEAM-5607
> URL: https://issues.apache.org/jira/browse/BEAM-5607
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an issue coming from future since the target version in the [initial 
> proposal|https://lists.apache.org/thread.html/5371469de567357b1431606f766217ef73a9098dc45046f51a6ecceb@%3Cdev.beam.apache.org%3E]
>  is up to 3.6. Running python tests in 3.7 would throw the following error 
> because the [typing library is no longer provisional in Python 
> 3.7|https://github.com/python/typing#important-dates]
> {code}
> sdks/python/.eggs/typing-3.6.6-py3.7.egg/typing.py", line 1004, in __new__
> self._abc_registry = extra._abc_registry
> AttributeError: type object 'Callable' has no attribute '_abc_registry'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5607) Typing library no longer provisional in Python 3.7

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-5607:
--

Assignee: Alexey Romanenko  (was: Manu Zhang)

> Typing library no longer provisional in Python 3.7
> --
>
> Key: BEAM-5607
> URL: https://issues.apache.org/jira/browse/BEAM-5607
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Manu Zhang
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an issue coming from future since the target version in the [initial 
> proposal|https://lists.apache.org/thread.html/5371469de567357b1431606f766217ef73a9098dc45046f51a6ecceb@%3Cdev.beam.apache.org%3E]
>  is up to 3.6. Running python tests in 3.7 would throw the following error 
> because the [typing library is no longer provisional in Python 
> 3.7|https://github.com/python/typing#important-dates]
> {code}
> sdks/python/.eggs/typing-3.6.6-py3.7.egg/typing.py", line 1004, in __new__
> self._abc_registry = extra._abc_registry
> AttributeError: type object 'Callable' has no attribute '_abc_registry'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5607) Typing library no longer provisional in Python 3.7

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-5607:
--

Assignee: Manu Zhang  (was: Alexey Romanenko)

> Typing library no longer provisional in Python 3.7
> --
>
> Key: BEAM-5607
> URL: https://issues.apache.org/jira/browse/BEAM-5607
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an issue coming from future since the target version in the [initial 
> proposal|https://lists.apache.org/thread.html/5371469de567357b1431606f766217ef73a9098dc45046f51a6ecceb@%3Cdev.beam.apache.org%3E]
>  is up to 3.6. Running python tests in 3.7 would throw the following error 
> because the [typing library is no longer provisional in Python 
> 3.7|https://github.com/python/typing#important-dates]
> {code}
> sdks/python/.eggs/typing-3.6.6-py3.7.egg/typing.py", line 1004, in __new__
> self._abc_registry = extra._abc_registry
> AttributeError: type object 'Callable' has no attribute '_abc_registry'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4038) Support Kafka Headers in KafkaIO

2018-10-02 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635696#comment-16635696
 ] 

Alexey Romanenko commented on BEAM-4038:


[~rangadi] Thank you. Do you know if we have a Jira to add write support? If 
not then I think it would make sense to create it. Wdyt?

> Support Kafka Headers in KafkaIO
> 
>
> Key: BEAM-4038
> URL: https://issues.apache.org/jira/browse/BEAM-4038
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-kafka
>Reporter: Geet Kumar
>Assignee: Geet Kumar
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Headers have been added to Kafka Consumer/Producer records (KAFKA-4208). The 
> purpose of this JIRA is to support this feature in KafkaIO.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3371) Add ability to stage directories with compiled classes to Spark

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-3371.

   Resolution: Fixed
Fix Version/s: 2.8.0

> Add ability to stage directories with compiled classes to Spark
> ---
>
> Key: BEAM-3371
> URL: https://issues.apache.org/jira/browse/BEAM-3371
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Minor
> Fix For: 2.8.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> This one is basically the same issue as
>  [this Flink's one|https://issues.apache.org/jira/browse/BEAM-3370], except 
> of two things:
> - a detection of files to stage has to be provided in Spark, which is already 
> being developed [here|https://issues.apache.org/jira/browse/BEAM-981]
> - the test execution is not interrupted by FileNotFoundException but by *the 
> effect* of the directory not being staged (absence of test classes on the 
> Spark's classpath, hence ClassNotFoundException).
> Again, this probably could be resolved analogously as in flink, while 
> BEAM-981 issue is resolved. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3371) Add ability to stage directories with compiled classes to Spark

2018-10-02 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3371:
--

Assignee: Lukasz Gajowy  (was: Jean-Baptiste Onofré)

> Add ability to stage directories with compiled classes to Spark
> ---
>
> Key: BEAM-3371
> URL: https://issues.apache.org/jira/browse/BEAM-3371
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-spark
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Minor
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> This one is basically the same issue as
>  [this Flink's one|https://issues.apache.org/jira/browse/BEAM-3370], except 
> of two things:
> - a detection of files to stage has to be provided in Spark, which is already 
> being developed [here|https://issues.apache.org/jira/browse/BEAM-981]
> - the test execution is not interrupted by FileNotFoundException but by *the 
> effect* of the directory not being staged (absence of test classes on the 
> Spark's classpath, hence ClassNotFoundException).
> Again, this probably could be resolved analogously as in flink, while 
> BEAM-981 issue is resolved. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3651) Port BigQueryTornadoesTest off DoFnTester

2018-10-01 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634270#comment-16634270
 ] 

Alexey Romanenko edited comment on BEAM-3651 at 10/1/18 4:27 PM:
-

Hi [~niuk-a], sure! Thank you for your contribution.


was (Author: aromanenko):
Hi [~niuk-a], sure! Thank you for your contribution.

[~iemejia] Could you add [~niuk-a] as a new contributor? Thanks

> Port BigQueryTornadoesTest off DoFnTester
> -
>
> Key: BEAM-3651
> URL: https://issues.apache.org/jira/browse/BEAM-3651
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3651) Port BigQueryTornadoesTest off DoFnTester

2018-10-01 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634273#comment-16634273
 ] 

Alexey Romanenko commented on BEAM-3651:


[~iemejia] Could you add [~niuk-a] as a new contributor? Thanks.

> Port BigQueryTornadoesTest off DoFnTester
> -
>
> Key: BEAM-3651
> URL: https://issues.apache.org/jira/browse/BEAM-3651
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3651) Port BigQueryTornadoesTest off DoFnTester

2018-10-01 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634270#comment-16634270
 ] 

Alexey Romanenko edited comment on BEAM-3651 at 10/1/18 4:26 PM:
-

Hi [~niuk-a], sure! Thank you for your contribution.

[~iemejia] Could you add [~niuk-a] as a new contributor? Thanks


was (Author: aromanenko):
Hi [~niuk-a], sure! Thank you for your contribution.

> Port BigQueryTornadoesTest off DoFnTester
> -
>
> Key: BEAM-3651
> URL: https://issues.apache.org/jira/browse/BEAM-3651
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3651) Port BigQueryTornadoesTest off DoFnTester

2018-10-01 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634270#comment-16634270
 ] 

Alexey Romanenko commented on BEAM-3651:


Hi [~niuk-a], sure! Thank you for your contribution.

> Port BigQueryTornadoesTest off DoFnTester
> -
>
> Key: BEAM-3651
> URL: https://issues.apache.org/jira/browse/BEAM-3651
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5530) Migrate to java.time lib instead of joda-time

2018-10-01 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-5530:
--

 Summary: Migrate to java.time lib instead of joda-time
 Key: BEAM-5530
 URL: https://issues.apache.org/jira/browse/BEAM-5530
 Project: Beam
  Issue Type: Improvement
  Components: dependencies
Reporter: Alexey Romanenko
 Fix For: 3.0.0


Joda-time has been used till moving to Java 8. For now, these two time 
libraries are used together. It will make sense finally to move everywhere to 
only one lib - *java.time* - as a standard Java time library (see mail list 
discussion: 
[https://lists.apache.org/thread.html/b10f6f9daed44f5fa65e315a44b68b2f57c3e80225f5d549b84918af@%3Cdev.beam.apache.org%3E]).

 

Since this migration will introduce breaking API changes, then we should 
address it to 3.0 release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4038) Support Kafka Headers in KafkaIO

2018-10-01 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633920#comment-16633920
 ] 

Alexey Romanenko commented on BEAM-4038:


[~gkumar7] Hi, are you going to continue working on this? I see that PR 5287 
was closed because of inactivity for more that 60 days.

> Support Kafka Headers in KafkaIO
> 
>
> Key: BEAM-4038
> URL: https://issues.apache.org/jira/browse/BEAM-4038
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-kafka
>Reporter: Geet Kumar
>Assignee: Geet Kumar
>Priority: Major
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Headers have been added to Kafka Consumer/Producer records (KAFKA-4208). The 
> purpose of this JIRA is to support this feature in KafkaIO.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-5144) [beam_PostCommit_Java_GradleBuild][org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMark][Flake] Expected messages count assert fails

2018-09-19 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-5144:
--

Assignee: Andrew Fulton  (was: Jean-Baptiste Onofré)

> [beam_PostCommit_Java_GradleBuild][org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMark][Flake]
>  Expected messages count assert fails
> ---
>
> Key: BEAM-5144
> URL: https://issues.apache.org/jira/browse/BEAM-5144
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Mikhail Gryzykhin
>Assignee: Andrew Fulton
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Failing job url:
> [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1196/testReport/]
>  Job history url:
> [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1245/testReport/junit/org.apache.beam.sdk.io.jms/JmsIOTest/testCheckpointMark/history/?start=50]
>  Relevant log:
> java.lang.AssertionError: expected:<0> but was:<5> at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.failNotEquals(Assert.java:834) at 
> org.junit.Assert.assertEquals(Assert.java:645) at 
> org.junit.Assert.assertEquals(Assert.java:631) at 
> org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMark(JmsIOTest.java:324) at
>  
> Multiple security exceptions found:
> java.lang.SecurityException: User name [test_user] or password is invalid.
> Aug 07, 2018 6:10:07 PM org.apache.activemq.broker.TransportConnection 
> processAddConnection WARNING: Failed to add Connection 
> ID:apache-beam-jenkins-slave-group-rq04-41395-1533665392741-59:9 due to {} 
> java.lang.SecurityException: User name [null] or password is invalid.
>  
>  
> Additionally, broker service stopped errors:
> org.apache.activemq.broker.BrokerStoppedException: Broker 
> BrokerService[localhost] is being stopped
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3912) Add batching support for HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3912:
---
Description: (was: For the moment, there is only HadoopInputFormatIO in 
Beam. To provide a support of different writing IOs, that are not yet natively 
supported in Beam (for example, Apache Orc or HBase bulk load), it would make 
sense to add HadoopOutputFormatIO as well.)

> Add batching support for HadoopOutputFormatIO
> -
>
> Key: BEAM-3912
> URL: https://issues.apache.org/jira/browse/BEAM-3912
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5310) Add support of HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-5310:
---
Description: 
For the moment, there is only HadoopInputFormatIO in Beam. To provide a support 
of different writing IOs, that are not yet natively supported in Beam (for 
example, Apache Orc or HBase bulk load), it would make sense to add 
HadoopOutputFormatIO as well. It will incorporate support of batching and 
streaming processing. 



> Add support of HadoopOutputFormatIO
> ---
>
> Key: BEAM-5310
> URL: https://issues.apache.org/jira/browse/BEAM-5310
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>
> For the moment, there is only HadoopInputFormatIO in Beam. To provide a 
> support of different writing IOs, that are not yet natively supported in Beam 
> (for example, Apache Orc or HBase bulk load), it would make sense to add 
> HadoopOutputFormatIO as well. It will incorporate support of batching and 
> streaming processing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5309) Add streaming support for HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-5309:
---
Summary: Add streaming support for HadoopOutputFormatIO  (was: Add 
streaming support for HadoopOutputFormat)

> Add streaming support for HadoopOutputFormatIO
> --
>
> Key: BEAM-5309
> URL: https://issues.apache.org/jira/browse/BEAM-5309
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: David Moravek
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5309) Add streaming support for HadoopOutputFormat

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-5309:
---
Issue Type: Sub-task  (was: Improvement)
Parent: BEAM-5310

> Add streaming support for HadoopOutputFormat
> 
>
> Key: BEAM-5309
> URL: https://issues.apache.org/jira/browse/BEAM-5309
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: David Moravek
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3912) Add batching support of HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3912:
---
Issue Type: Sub-task  (was: Improvement)
Parent: BEAM-5310

> Add batching support of HadoopOutputFormatIO
> 
>
> Key: BEAM-3912
> URL: https://issues.apache.org/jira/browse/BEAM-3912
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For the moment, there is only HadoopInputFormatIO in Beam. To provide a 
> support of different writing IOs, that are not yet natively supported in Beam 
> (for example, Apache Orc or HBase bulk load), it would make sense to add 
> HadoopOutputFormatIO as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3912) Add batching support for HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3912:
---
Summary: Add batching support for HadoopOutputFormatIO  (was: Add batching 
support of HadoopOutputFormatIO)

> Add batching support for HadoopOutputFormatIO
> -
>
> Key: BEAM-3912
> URL: https://issues.apache.org/jira/browse/BEAM-3912
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For the moment, there is only HadoopInputFormatIO in Beam. To provide a 
> support of different writing IOs, that are not yet natively supported in Beam 
> (for example, Apache Orc or HBase bulk load), it would make sense to add 
> HadoopOutputFormatIO as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5310) Add support of HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-5310:
--

 Summary: Add support of HadoopOutputFormatIO
 Key: BEAM-5310
 URL: https://issues.apache.org/jira/browse/BEAM-5310
 Project: Beam
  Issue Type: Improvement
  Components: io-java-hadoop
Reporter: Alexey Romanenko
Assignee: Alexey Romanenko






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3912) Add batching support of HadoopOutputFormatIO

2018-09-05 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3912:
---
Summary: Add batching support of HadoopOutputFormatIO  (was: Add support of 
HadoopOutputFormatIO)

> Add batching support of HadoopOutputFormatIO
> 
>
> Key: BEAM-3912
> URL: https://issues.apache.org/jira/browse/BEAM-3912
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-hadoop
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For the moment, there is only HadoopInputFormatIO in Beam. To provide a 
> support of different writing IOs, that are not yet natively supported in Beam 
> (for example, Apache Orc or HBase bulk load), it would make sense to add 
> HadoopOutputFormatIO as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5309) Add streaming support for HadoopOutputFormat

2018-09-05 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-5309:
--

 Summary: Add streaming support for HadoopOutputFormat
 Key: BEAM-5309
 URL: https://issues.apache.org/jira/browse/BEAM-5309
 Project: Beam
  Issue Type: Improvement
  Components: io-java-hadoop
Reporter: Alexey Romanenko
Assignee: David Moravek






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5060) Issues with aws KPL while writing to kinesis using beam

2018-08-23 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589924#comment-16589924
 ] 

Alexey Romanenko commented on BEAM-5060:


I changed priority to "Critical" since it's a release blocker

> Issues with aws KPL while writing to kinesis using beam
> ---
>
> Key: BEAM-5060
> URL: https://issues.apache.org/jira/browse/BEAM-5060
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Affects Versions: 2.5.0
>Reporter: Varsha Thanooj
>Assignee: Alexey Romanenko
>Priority: Critical
> Attachments: pom.xml
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I am trying to write data to kinesis using apache beam kinesis IO. But I am 
> having some issues.
> PS: I am using aws sts.
>  
> The console output shows
>  
> {code:java}
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:349)
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:319)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:210)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:66)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
> at com.nestaway.beam_demo.KinesisSql.main(KinesisSql.java:153)
> Caused by: java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.processElement(KinesisIO.java:568)
> {code}
>  
>  
> Code...
> data is a Pcollection in byte[ ] format.
>  
> {code:java}
> data.apply(KinesisIO.write()
> .withStreamName("stagWatchBallEventStream")
> .withPartitionKey("a")
> .withAWSClientsProvider(new CustomKinesisClientProvider()));
> {code}
>  
>  Custom Kinesis Client :
> {code:java}
> public class CustomKinesisClientProvider implements AWSClientsProvider {
> private static final long serialVersionUID = 1L;
> private static String ID = "X";
> private static String SECRET = "X";
> private static String TOKEN = "X";
> private static BasicSessionCredentials sessionCredentials = new 
> BasicSessionCredentials(
>   ID,
>   SECRET,
>   TOKEN);
> private static KinesisProducerConfiguration config = new 
> KinesisProducerConfiguration()
>    .setRecordMaxBufferedTime(3000)
>    .setMaxConnections(1)
>    .setRequestTimeout(6)
>    .setRegion("us-west-2")
>    .setCredentialsProvider(new 
> AWSStaticCredentialsProvider(sessionCredentials));
> @Override
> public IKinesisProducer createKinesisProducer(KinesisProducerConfiguration 
> config) {
> return new KinesisProducer(config);
> }
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5060) Issues with aws KPL while writing to kinesis using beam

2018-08-23 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-5060:
---
Priority: Critical  (was: Major)

> Issues with aws KPL while writing to kinesis using beam
> ---
>
> Key: BEAM-5060
> URL: https://issues.apache.org/jira/browse/BEAM-5060
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Affects Versions: 2.5.0
>Reporter: Varsha Thanooj
>Assignee: Alexey Romanenko
>Priority: Critical
> Attachments: pom.xml
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I am trying to write data to kinesis using apache beam kinesis IO. But I am 
> having some issues.
> PS: I am using aws sts.
>  
> The console output shows
>  
> {code:java}
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:349)
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:319)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:210)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:66)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
> at com.nestaway.beam_demo.KinesisSql.main(KinesisSql.java:153)
> Caused by: java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.processElement(KinesisIO.java:568)
> {code}
>  
>  
> Code...
> data is a Pcollection in byte[ ] format.
>  
> {code:java}
> data.apply(KinesisIO.write()
> .withStreamName("stagWatchBallEventStream")
> .withPartitionKey("a")
> .withAWSClientsProvider(new CustomKinesisClientProvider()));
> {code}
>  
>  Custom Kinesis Client :
> {code:java}
> public class CustomKinesisClientProvider implements AWSClientsProvider {
> private static final long serialVersionUID = 1L;
> private static String ID = "X";
> private static String SECRET = "X";
> private static String TOKEN = "X";
> private static BasicSessionCredentials sessionCredentials = new 
> BasicSessionCredentials(
>   ID,
>   SECRET,
>   TOKEN);
> private static KinesisProducerConfiguration config = new 
> KinesisProducerConfiguration()
>    .setRecordMaxBufferedTime(3000)
>    .setMaxConnections(1)
>    .setRequestTimeout(6)
>    .setRegion("us-west-2")
>    .setCredentialsProvider(new 
> AWSStaticCredentialsProvider(sessionCredentials));
> @Override
> public IKinesisProducer createKinesisProducer(KinesisProducerConfiguration 
> config) {
> return new KinesisProducer(config);
> }
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-23 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3654:
---
Fix Version/s: (was: Not applicable)
   2.7.0

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Mikhail
>Priority: Major
>  Labels: beginner, newbie, starter
> Fix For: 2.7.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3654:
--

Assignee: (was: Reuven Lax)

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3654:
--

Assignee: Reuven Lax

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Reuven Lax
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3654:
--

Assignee: (was: Reuven Lax)

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3654) Port FilterExamplesTest off DoFnTester

2018-08-22 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3654:
--

Assignee: Reuven Lax

> Port FilterExamplesTest off DoFnTester
> --
>
> Key: BEAM-3654
> URL: https://issues.apache.org/jira/browse/BEAM-3654
> Project: Beam
>  Issue Type: Sub-task
>  Components: examples-java
>Reporter: Kenneth Knowles
>Assignee: Reuven Lax
>Priority: Major
>  Labels: beginner, newbie, starter
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5060) Issues with aws KPL while writing to kinesis using beam

2018-08-20 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586158#comment-16586158
 ] 

Alexey Romanenko commented on BEAM-5060:


[~bvt279] Thank you for reporting this. Could you attach pom file which you use 
to build your Beam application?

> Issues with aws KPL while writing to kinesis using beam
> ---
>
> Key: BEAM-5060
> URL: https://issues.apache.org/jira/browse/BEAM-5060
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-aws
>Affects Versions: 2.5.0
>Reporter: Varsha Thanooj
>Assignee: Alexey Romanenko
>Priority: Major
>
> I am trying to write data to kinesis using apache beam kinesis IO. But I am 
> having some issues.
> PS: I am using aws sts.
>  
> The console output shows
>  
> {code:java}
> Exception in thread "main" 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:349)
> at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:319)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:210)
> at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:66)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
> at com.nestaway.beam_demo.KinesisSql.main(KinesisSql.java:153)
> Caused by: java.lang.NoSuchMethodError: 
> com.amazonaws.services.kinesis.producer.IKinesisProducer.addUserRecord(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/nio/ByteBuffer;)Lorg/apache/beam/repackaged/beam_sdks_java_io_kinesis/com/google/common/util/concurrent/ListenableFuture;
> at 
> org.apache.beam.sdk.io.kinesis.KinesisIO$Write$KinesisWriterFn.processElement(KinesisIO.java:568)
> {code}
>  
>  
> Code...
> data is a Pcollection in byte[ ] format.
>  
> {code:java}
> data.apply(KinesisIO.write()
> .withStreamName("stagWatchBallEventStream")
> .withPartitionKey("a")
> .withAWSClientsProvider(new CustomKinesisClientProvider()));
> {code}
>  
>  Custom Kinesis Client :
> {code:java}
> public class CustomKinesisClientProvider implements AWSClientsProvider {
> private static final long serialVersionUID = 1L;
> private static String ID = "X";
> private static String SECRET = "X";
> private static String TOKEN = "X";
> private static BasicSessionCredentials sessionCredentials = new 
> BasicSessionCredentials(
>   ID,
>   SECRET,
>   TOKEN);
> private static KinesisProducerConfiguration config = new 
> KinesisProducerConfiguration()
>    .setRecordMaxBufferedTime(3000)
>    .setMaxConnections(1)
>    .setRequestTimeout(6)
>    .setRegion("us-west-2")
>    .setCredentialsProvider(new 
> AWSStaticCredentialsProvider(sessionCredentials));
> @Override
> public IKinesisProducer createKinesisProducer(KinesisProducerConfiguration 
> config) {
> return new KinesisProducer(config);
> }
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-5147) Expose document metadata in ElasticsearchIO read

2018-08-15 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-5147.

   Resolution: Fixed
Fix Version/s: 2.7.0

> Expose document metadata in ElasticsearchIO read
> 
>
> Key: BEAM-5147
> URL: https://issues.apache.org/jira/browse/BEAM-5147
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-elasticsearch
>Affects Versions: 2.7.0
>Reporter: Tim Robertson
>Assignee: Etienne Chauchot
>Priority: Minor
> Fix For: 2.7.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The beam ElasticsearchIO read does not give access to document metadata. A 
> Beam user has requested that it be possible to expose the entire elastic 
> document including metadata from the search result. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4803) Beam spark runner not working properly with kafka

2018-08-10 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576473#comment-16576473
 ] 

Alexey Romanenko commented on BEAM-4803:


[~rangadi] Interesting! Don't you know why this value (1 minute) of cache 
period was chosen? Is it configurable? I'm thinking to make it configurable for 
Spark runner as well.

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Vivek Agarwal
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4632) KafkaIO seems to fail on streaming mode over spark runner

2018-08-10 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-4632.

   Resolution: Cannot Reproduce
Fix Version/s: 2.6.0

> KafkaIO seems to fail on streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Alexey Romanenko
>Priority: Major
> Fix For: 2.6.0
>
> Attachments: .withMaxNumRecords(50).JPG, 
> DB_table_kafkabeamdata_count.JPG, error UnboundedDataset.java 81(0) has 
> different number of partitions.JPG, the error GeneratedMessageV3.JPG, the 
> error GeneratedMessageV3.JPG
>
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4632) KafkaIO seems to fail on streaming mode over spark runner

2018-08-10 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576039#comment-16576039
 ] 

Alexey Romanenko commented on BEAM-4632:


[~Ricklin] Unfortunately, I can't reproduce your issue on my side, both 
pipelines works as expected with latest stable Beam (2.6.0). Perhaps, this 
error  was caused by your environment settings. 

> KafkaIO seems to fail on streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: .withMaxNumRecords(50).JPG, 
> DB_table_kafkabeamdata_count.JPG, error UnboundedDataset.java 81(0) has 
> different number of partitions.JPG, the error GeneratedMessageV3.JPG, the 
> error GeneratedMessageV3.JPG
>
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4803) Beam spark runner not working properly with kafka

2018-08-08 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573139#comment-16573139
 ] 

Alexey Romanenko commented on BEAM-4803:


[~vivek_17] Yes, as you properly noticed, runner should cache readers and it 
uses {{com.google.common.cache.Cache}} internally for that purpose and this 
cache is created with {{expireAfterAccess(long duration, TimeUnit unit)}}, 
where {{duration = readerCacheInterval}}. 
Seems that it's not enough for microbatches and cache entry expires very 
quickly. Perhaps, we need to rethink this cache policy for streaming mode. 

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Vivek Agarwal
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4632) KafkaIO seems to fail on streaming mode over spark runner

2018-07-20 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550912#comment-16550912
 ] 

Alexey Romanenko commented on BEAM-4632:


[~Ricklin] Thank you for all this provided information. Regarding your 
questions about windows/triggers I'd suggest address them on 
u...@beam.apache.org mailing list.

For your last question about broken pipeline. Could you provide the whole code 
of your pipeline which reproduces this issue? I'll try to do this on my side.



> KafkaIO seems to fail on streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: .withMaxNumRecords(50).JPG, 
> DB_table_kafkabeamdata_count.JPG, error UnboundedDataset.java 81(0) has 
> different number of partitions.JPG, the error GeneratedMessageV3.JPG, the 
> error GeneratedMessageV3.JPG
>
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4803) Beam spark runner not working properly with kafka

2018-07-20 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4803:
---
Affects Version/s: 2.5.0

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Vivek Agarwal
>Assignee: Raghu Angadi
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4803) Beam spark runner not working properly with kafka

2018-07-20 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550897#comment-16550897
 ] 

Alexey Romanenko commented on BEAM-4803:


I can confirm that it starts new job every time, not sure if it's expected 
behaviour or not, need to be investigated
[~rangadi] I unassign myself since I'll be off for next 2 weeks. I can get back 
to this after if no one else will take it.

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Vivek Agarwal
>Assignee: Raghu Angadi
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4803) Beam spark runner not working properly with kafka

2018-07-20 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-4803:
--

Assignee: Raghu Angadi  (was: Alexey Romanenko)

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
>Reporter: Vivek Agarwal
>Assignee: Raghu Angadi
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4798) IndexOutOfBoundsException when Flink parallelism > 1

2018-07-19 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549044#comment-16549044
 ] 

Alexey Romanenko commented on BEAM-4798:


Yes, I did, Thanks!

> IndexOutOfBoundsException when Flink parallelism > 1
> 
>
> Key: BEAM-4798
> URL: https://issues.apache.org/jira/browse/BEAM-4798
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Major
>
> Running job on Flink in streaming mode and get data from a Kafka topic with 
> parallelism > 1 causes an exception:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>   at java.util.ArrayList.get(ArrayList.java:433)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:277)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
>   at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It happens when number of Kafka topic partitions is less than value of 
> parallelism (number of task slots).
> So, workaround for now can be to set parallelism <= number of topic 
> partitions, thus if parallelism=2 then number_partitions >= 2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4798) IndexOutOfBoundsException when Flink parallelism > 1

2018-07-19 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4798:
---
Affects Version/s: 2.4.0

> IndexOutOfBoundsException when Flink parallelism > 1
> 
>
> Key: BEAM-4798
> URL: https://issues.apache.org/jira/browse/BEAM-4798
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Major
>
> Running job on Flink in streaming mode and get data from a Kafka topic with 
> parallelism > 1 causes an exception:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>   at java.util.ArrayList.get(ArrayList.java:433)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:277)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
>   at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It happens when number of Kafka topic partitions is less than value of 
> parallelism (number of task slots).
> So, workaround for now can be to set parallelism <= number of topic 
> partitions, thus if parallelism=2 then number_partitions >= 2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4803) Beam spark runner not working properly with kafka

2018-07-18 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547834#comment-16547834
 ] 

Alexey Romanenko commented on BEAM-4803:


[~rangadi] Sure, I'll take a look on this.
[~vivek_17] Could you check and confirm the same issue against Beam 2.5.0?

> Beam spark runner not working properly with kafka
> -
>
> Key: BEAM-4803
> URL: https://issues.apache.org/jira/browse/BEAM-4803
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
>Reporter: Vivek Agarwal
>Assignee: Alexey Romanenko
>Priority: Major
>
> We are running a beam stream processing job on a spark runner, which reads 
> from a kafka topic using kerberos authentication. We are using java-io-kafka 
> v2.4.0 to read from kafka topic in the pipeline. The issue is that the 
> kafkaIO client is continuously creating a new kafka consumer with specified 
> config, doing kerberos login every time. Also, there are spark streaming jobs 
> which get spawned for the unbounded source, every second or so even when 
> there is no data in the kafka topic. Log has these jobs-
> INFO SparkContext: Starting job: dstr...@sparkunboumdedsource.java:172
> We can see in the logs
> INFO MicrobatchSource: No cached reader found for split: 
> [org.apache.beam.sdk.io.kafka.KafkaUnboundedSource@2919a728]. Creating new 
> reader at checkpoint mark...
> And then it creates new consumer doing fresh kerberos login, which is 
> creating issues.
> We are unsure of what should be correct behavior here and why so many spark 
> streaming jobs are getting created. We tried the beam code with flink runner 
> and did not find this issue there. Can someone point to the correct settings 
> for using unbounded kafka source with spark runner using beam? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4622) Many Beam SQL expressions never have their validation called

2018-07-17 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-4622.

   Resolution: Fixed
Fix Version/s: 2.6.0

> Many Beam SQL expressions never have their validation called
> 
>
> Key: BEAM-4622
> URL: https://issues.apache.org/jira/browse/BEAM-4622
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Alexey Romanenko
>Priority: Major
>  Labels: easyfix, newbie, starter
> Fix For: 2.6.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{BeamSqlFnExecutor}} there is a pattern where first the returned 
> expression is assigned to a variable {{ret}} and then after a giant switch 
> statement the validation is invoked. But there are many code paths that just 
> call {{return}} and skip validation. This should be refactored so it is 
> impossible to short-circuit on accident like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4798) IndexOutOfBoundsException when Flink parallelism > 1

2018-07-16 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-4798:
--

 Summary: IndexOutOfBoundsException when Flink parallelism > 1
 Key: BEAM-4798
 URL: https://issues.apache.org/jira/browse/BEAM-4798
 Project: Beam
  Issue Type: Bug
  Components: runner-flink
Affects Versions: 2.5.0
Reporter: Alexey Romanenko
Assignee: Alexey Romanenko


Running job on Flink in streaming mode and get data from a Kafka topic with 
parallelism > 1 causes an exception:

{noformat}
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at 
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:277)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at 
org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
at java.lang.Thread.run(Thread.java:748)

{noformat}

It happens when number of Kafka topic partitions is less than value of 
parallelism (number of task slots).
So, workaround for now can be to set parallelism <= number of topic partitions, 
thus if parallelism=2 then number_partitions >= 2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-4622) Many Beam SQL expressions never have their validation called

2018-06-29 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528145#comment-16528145
 ] 

Alexey Romanenko edited comment on BEAM-4622 at 6/29/18 7:48 PM:
-

[~kenn] Can I take this one? Actually, I did a refactoring and some tests are 
failing because of this short-circuit. I'll try to fix it.


was (Author: aromanenko):
Can I take this one? Actually, I did a refactoring and some tests are failing 
because of this short-circuit. I'll try to fix it.

> Many Beam SQL expressions never have their validation called
> 
>
> Key: BEAM-4622
> URL: https://issues.apache.org/jira/browse/BEAM-4622
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: easyfix, newbie, starter
>
> In {{BeamSqlFnExecutor}} there is a pattern where first the returned 
> expression is assigned to a variable {{ret}} and then after a giant switch 
> statement the validation is invoked. But there are many code paths that just 
> call {{return}} and skip validation. This should be refactored so it is 
> impossible to short-circuit on accident like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4622) Many Beam SQL expressions never have their validation called

2018-06-29 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528145#comment-16528145
 ] 

Alexey Romanenko commented on BEAM-4622:


Can I take this one? Actually, I did a refactoring and some tests are failing 
because of this short-circuit. I'll try to fix it.

> Many Beam SQL expressions never have their validation called
> 
>
> Key: BEAM-4622
> URL: https://issues.apache.org/jira/browse/BEAM-4622
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Priority: Major
>  Labels: easyfix, newbie, starter
>
> In {{BeamSqlFnExecutor}} there is a pattern where first the returned 
> expression is assigned to a variable {{ret}} and then after a giant switch 
> statement the validation is invoked. But there are many code paths that just 
> call {{return}} and skip validation. This should be refactored so it is 
> impossible to short-circuit on accident like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4587) Test interoperability between Spark, Flink and Beam in terms of reading/writing Parquet files

2018-06-29 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527926#comment-16527926
 ] 

Alexey Romanenko commented on BEAM-4587:


Great to hear that it works fine. Good job, [~ŁukaszG]!

> Test interoperability between Spark, Flink and Beam in terms of 
> reading/writing Parquet files
> -
>
> Key: BEAM-4587
> URL: https://issues.apache.org/jira/browse/BEAM-4587
> Project: Beam
>  Issue Type: Task
>  Components: io-java-parquet
>Reporter: Łukasz Gajowy
>Assignee: Łukasz Gajowy
>Priority: Minor
> Fix For: 2.5.0
>
>
> Since ParquetIO is merged to master, we should test how it behaves with 
> parquet files created by native Spark and Flink applications. 
> More specifically, we should:
>  - test if files created by Flink/Spark can be read successfully using 
> ParquetIO in Beam
>  - test if files created by beam using ParquetIO in Beam can be read using 
> Flink/Spark native application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4632) KafkaIO seems to fail on streaming mode over spark runner

2018-06-26 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523937#comment-16523937
 ] 

Alexey Romanenko commented on BEAM-4632:


Seems that it works in streaming mode only when I run a pipeline with 
{{waitUntilFinish()}}. Otherwise, it just stops after running. So, probably 
it's an issue with SparkRunner, I'll investigate it.

[~Ricklin] Could you run your pipeline with {{p.run().waitUntilFinish();}} to 
check if it works for you?

> KafkaIO seems to fail on streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Alexey Romanenko
>Priority: Major
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3246) BigtableIO should merge splits if they exceed 15K

2018-06-26 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523346#comment-16523346
 ] 

Alexey Romanenko commented on BEAM-3246:


Thanks you!

> BigtableIO should merge splits if they exceed 15K
> -
>
> Key: BEAM-3246
> URL: https://issues.apache.org/jira/browse/BEAM-3246
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> A customer hit a problem with a large number of splits.  CloudBitableIO fixes 
> that here 
> https://github.com/GoogleCloudPlatform/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java#L241
> BigtableIO should have similar logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4632) kafkIO should be the streaming mode over spark runner

2018-06-25 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522491#comment-16522491
 ] 

Alexey Romanenko commented on BEAM-4632:


One remark - according to your [pom 
file|https://github.com/LinRick/beamkafkaIO/blob/1d65cc72c0e29ddf6080873507ab2db7d0cc8671/pom.xml#L35]
 you use Beam 2.5.0-SNAPSHOT

> kafkIO should be the streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 2.4.0
>
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4632) kafkIO should be the streaming mode over spark runner

2018-06-25 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522486#comment-16522486
 ] 

Alexey Romanenko commented on BEAM-4632:


Hello Rick, thank you for report. 

Do you see any error messages when you run your pipeline on Spark? Could you 
attach Spark logs to this Jira issue?

> kafkIO should be the streaming mode over spark runner
> -
>
> Key: BEAM-4632
> URL: https://issues.apache.org/jira/browse/BEAM-4632
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka, runner-spark
>Affects Versions: 2.4.0
> Environment: Ubuntu 16.04.4 LTS
>Reporter: Rick Lin
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 2.4.0
>
>
> Dear sir,
> The following versions of related tools are set in my running program:
> ==
> Beam 2.4.0 (Direct runner and Spark runner)
> Spark 2.2.1 (local mode and standalone mode)
> Kafka: 2.11-0.10.1.1
> scala: 2.11.8
> java: 1.8
> ==
> My programs (KafkaToKafka.java and StarterPipeline.java) are as shown on my 
> github: [https://github.com/LinRick/beamkafkaIO],
> The description of my situation is as:
> {color:#14892c}The kafka broker is working and kafkaIO.read (consumer) is 
> used to capture data from the assigned broker ip 
> ([http://ubuntu7:9092)|http://ubuntu7:9092)./].{color}
> {color:#14892c}The user manual of kafkaIO SDK (on 
> web:[https://beam.apache.org/documentation/sdks/javadoc/2.4.0/])  indicates 
> that the following parameters need to be set, and then the kafkaIO can work 
> well.{color}
>  {color:#FF}.withBootstrapServers("kafka broker ip:9092"){color}
> {color:#FF} .withTopic("kafkasink"){color}
> {color:#FF} .withKeyDeserializer(IntegerDeserializer.class){color}
> {color:#FF} .withValueDeserializer(StringDeserializer.class) {color}
> When i run my program with these settings over direct runner, i can find that 
> my program perform well. In addition, my running program is the streaming 
> mode. *However, i run these codes with the same settings (kafkaIO) over spark 
> runner, and my running program is not the streaming mode and is shutdown*. 
> Here, as mentioned on the website: 
> [https://beam.apache.org/documentation/runners/spark/], the performing 
> program will automatically set streaming mode. 
> Unfortunately, it failed for my program.
> On the other hand, If i set the parameter  kafkaIO.read.withMaxNumRecords 
> (1000) or  kafkaIO.read.withMaxReadTime (Duration second), my program will 
> successfully execute as the batch mode (batch processing).
> The steps of performing StarterPipeline.java in my program are:
> step1 mvn compile exec:java -Dexec.mainClass=com.itri.beam.StarterPipeline 
> -Pspark2-runner -Dexec.args="--runner=SparkRunner"
> step2 mvn clean package
> step3 cp -rf target/beamkafkaIO-0.1.jar /root/
> step4 cd /spark-2.2.1-bin-hadoop2.6/bin
> step5 ./spark-submit --class com.itri.beam.StarterPipeline --master local[4] 
> /root/beamkafkaIO-0.1.jar --runner=SparkRunner
> I am not sure if this issue is a bug about kafkaIO or I was wrong with some 
> parameter settings over spark runner ?
> I really can't handle it, so I hope to get help from you.
> if any further information is needed, i am glad to be informed and will 
> provide to you as soon as possible.
> I will highly appreciate it if you can help me to deal with this issue.
> i am looking forward to hearing from you.
>  
> Sincerely yours,
>  
> Rick
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3246) BigtableIO should merge splits if they exceed 15K

2018-06-22 Thread Alexey Romanenko (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520417#comment-16520417
 ] 

Alexey Romanenko commented on BEAM-3246:


Can we resolve this issue since PR was already merged?

> BigtableIO should merge splits if they exceed 15K
> -
>
> Key: BEAM-3246
> URL: https://issues.apache.org/jira/browse/BEAM-3246
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> A customer hit a problem with a large number of splits.  CloudBitableIO fixes 
> that here 
> https://github.com/GoogleCloudPlatform/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java#L241
> BigtableIO should have similar logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4463) Add unit tests for S3ReadableSeekableByteChannel

2018-06-04 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4463:
---
Description: There are no any tests for _S3ReadableSeekableByteChannel,_ it 
can cause the issues like BEAM-4421  (was: There is not tests for 
_S3ReadableSeekableByteChannel,_ it can cause the issues like BEAM-4421)

> Add unit tests for S3ReadableSeekableByteChannel
> 
>
> Key: BEAM-4463
> URL: https://issues.apache.org/jira/browse/BEAM-4463
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-aws
>Reporter: Alexey Romanenko
>Priority: Minor
>
> There are no any tests for _S3ReadableSeekableByteChannel,_ it can cause the 
> issues like BEAM-4421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4463) Add unit tests for S3ReadableSeekableByteChannel

2018-06-04 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-4463:
--

 Summary: Add unit tests for S3ReadableSeekableByteChannel
 Key: BEAM-4463
 URL: https://issues.apache.org/jira/browse/BEAM-4463
 Project: Beam
  Issue Type: Improvement
  Components: io-java-aws
Reporter: Alexey Romanenko
Assignee: Ismaël Mejía


There is not tests for _S3ReadableSeekableByteChannel,_ it can cause the issues 
like BEAM-4421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4421) Cannot read s3 files using ParquetIO

2018-06-01 Thread Alexey Romanenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-4421:
--

Assignee: Alexey Romanenko  (was: Łukasz Gajowy)

> Cannot read s3 files using ParquetIO
> 
>
> Key: BEAM-4421
> URL: https://issues.apache.org/jira/browse/BEAM-4421
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-parquet
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: errorlog.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For s3 the read doesn't work and throws an IOException. Please see the 
> enclosed logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4397) Add details of using org.apache.beam.sdk.io.hdfs with different FS

2018-05-24 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4397:
---
Description: We need to add more details how to configure 
"org.apache.beam.sdk.io.hdfs" to use it for different file systems (e.g. HDFS, 
S3, MapRFS, etc) like we have for HadoopInputFormat 
([https://beam.apache.org/documentation/io/built-in/hadoop/])  (was: We need to 
add more details how to configure HadoopFileSystem to use it for different file 
systems (e.g. HDFS, S3, MapRFS, etc) like we have for HadoopInputFormat 
([https://beam.apache.org/documentation/io/built-in/hadoop/]))

> Add details of using org.apache.beam.sdk.io.hdfs with different FS
> --
>
> Key: BEAM-4397
> URL: https://issues.apache.org/jira/browse/BEAM-4397
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>
> We need to add more details how to configure "org.apache.beam.sdk.io.hdfs" to 
> use it for different file systems (e.g. HDFS, S3, MapRFS, etc) like we have 
> for HadoopInputFormat 
> ([https://beam.apache.org/documentation/io/built-in/hadoop/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4397) Add details of using org.apache.beam.sdk.io.hdfs with different FS

2018-05-24 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4397:
---
Summary: Add details of using org.apache.beam.sdk.io.hdfs with different FS 
 (was: Add details of using HadoopFileSystem with different FS)

> Add details of using org.apache.beam.sdk.io.hdfs with different FS
> --
>
> Key: BEAM-4397
> URL: https://issues.apache.org/jira/browse/BEAM-4397
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>
> We need to add more details how to configure HadoopFileSystem to use it for 
> different file systems (e.g. HDFS, S3, MapRFS, etc) like we have for 
> HadoopInputFormat 
> ([https://beam.apache.org/documentation/io/built-in/hadoop/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4397) Add details of using HadoopFileSystem with different FS

2018-05-24 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4397:
---
Description: We need to add more details how to configure HadoopFileSystem 
to use it for different file systems (e.g. HDFS, S3, MapRFS, etc) like we have 
for HadoopInputFormat 
([https://beam.apache.org/documentation/io/built-in/hadoop/])  (was: We need to 
add more details how to configure HadoopFileSystem to use it for different file 
systems, like HDFS, S3, MapRFS, etc, like we have for HadoopInputFormat 
([https://beam.apache.org/documentation/io/built-in/hadoop/]))

> Add details of using HadoopFileSystem with different FS
> ---
>
> Key: BEAM-4397
> URL: https://issues.apache.org/jira/browse/BEAM-4397
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>
> We need to add more details how to configure HadoopFileSystem to use it for 
> different file systems (e.g. HDFS, S3, MapRFS, etc) like we have for 
> HadoopInputFormat 
> ([https://beam.apache.org/documentation/io/built-in/hadoop/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4397) Add details of using HadoopFileSystem with different FS

2018-05-24 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4397:
---
Description: We need to add more details how to configure HadoopFileSystem 
to use it for different file systems, like HDFS, S3, MapRFS, etc, like we have 
for HadoopInputFormat 
([https://beam.apache.org/documentation/io/built-in/hadoop/])  (was: We need to 
add more details how to configure HadoopFileSystem to use it for different file 
systems, like HDFS, S3, MapRFS, etc)

> Add details of using HadoopFileSystem with different FS
> ---
>
> Key: BEAM-4397
> URL: https://issues.apache.org/jira/browse/BEAM-4397
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
>
> We need to add more details how to configure HadoopFileSystem to use it for 
> different file systems, like HDFS, S3, MapRFS, etc, like we have for 
> HadoopInputFormat 
> ([https://beam.apache.org/documentation/io/built-in/hadoop/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4397) Add details of using HadoopFileSystem with different FS

2018-05-24 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-4397:
--

 Summary: Add details of using HadoopFileSystem with different FS
 Key: BEAM-4397
 URL: https://issues.apache.org/jira/browse/BEAM-4397
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Alexey Romanenko
Assignee: Alexey Romanenko


We need to add more details how to configure HadoopFileSystem to use it for 
different file systems, like HDFS, S3, MapRFS, etc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4086) KafkaIOTest is flaky

2018-04-19 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443769#comment-16443769
 ] 

Alexey Romanenko commented on BEAM-4086:


[~rangadi] For me it happened not every time, I'd say 1 out of 4-5.

> KafkaIOTest is flaky
> 
>
> Key: BEAM-4086
> URL: https://issues.apache.org/jira/browse/BEAM-4086
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kafka
>Affects Versions: 2.5.0
>Reporter: Ismaël Mejía
>Assignee: Raghu Angadi
>Priority: Major
>
> Noticed this while trying to do a simple change in KafkaIO this morning and 
> corroborated with other contributors. If you run `./gradlew -p 
> sdks/java/io/kafka/ clean build` it blocks indefinitely at least 1/3 of the 
> times. However it passes ok with Maven.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-18 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442721#comment-16442721
 ] 

Alexey Romanenko edited comment on BEAM-3484 at 4/18/18 3:49 PM:
-

This issue can be easily reproduced by increasing the number of splits, it can 
be done with this config option:
{code:java}
conf.setInt("mapreduce.job.maps", 16);
{code}
The more splits we have, the more chances that we will hit this issue.

The root cause of this is the following - for every split it creates and 
perform a separate SQL query, like:

{code:java}
SELECT id, name FROM TableName LIMIT Y OFFSET X
{code}

where X and Y depend on split and total number of rows in table.

Since, RDBMS doesn't guarantee the order of results in case when "ORDER BY" was 
not used then we can have an intersection of some splits and some rows will be 
duplicated and others will be missed. To avoid this, SQL query should order 
results by one or several keys, like:

{code:java}
SELECT id, name FROM TableName ORDER BY id LIMIT Y OFFSET X 
{code}



was (Author: aromanenko):
This issue can be easily reproduced by increasing the number of splits, it can 
be done with this config option:
{code:java}
conf.setInt("mapreduce.job.maps", 16);
{code}
The more splits we have, the more chances that we will hit this issue.

The root cause of this is the following - for every split it creates and 
perform a separate SQL query, like:

{code:java}
SELECT id, name FROM TableName LIMIT Y OFFSET X
{code}

where X and Y depend on split and total number of rows in table.

Since, RDBMS doesn't guarantee the order of results in case when "ORDER BY" was 
not used then we can have an intersection of some splits and some rows will be 
duplicated and others will be missed. To avoid this, SQL query should order 
results by one or several keys.

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-18 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442721#comment-16442721
 ] 

Alexey Romanenko commented on BEAM-3484:


This issue can be easily reproduced by increasing the number of splits, it can 
be done with this config option:
{code:java}
conf.setInt("mapreduce.job.maps", 16);
{code}
The more splits we have, the more chances that we will hit this issue.

The root cause of this is the following - for every split it creates and 
perform a separate SQL query, like:

{code:java}
SELECT id, name FROM TableName LIMIT Y OFFSET X
{code}

where X and Y depend on split and total number of rows in table.

Since, RDBMS doesn't guarantee the order of results in case when "ORDER BY" was 
not used then we can have an intersection of some splits and some rows will be 
duplicated and others will be missed. To avoid this, SQL query should order 
results by one or several keys.

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4086) KafkaIO is flaky with gradle build

2018-04-16 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439210#comment-16439210
 ] 

Alexey Romanenko commented on BEAM-4086:


It hangs with maven too sometimes:

{code:java}
mvn clean verify -Prelease -pl sdks/java/io/kafka
{code}


> KafkaIO is flaky with gradle build
> --
>
> Key: BEAM-4086
> URL: https://issues.apache.org/jira/browse/BEAM-4086
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-kafka
>Affects Versions: 2.5.0
>Reporter: Ismaël Mejía
>Priority: Major
>
> Noticed this while trying to do a simple change in KafkaIO this morning and 
> corroborated with other contributors. If you run `./gradlew -p 
> sdks/java/io/kafka/ clean build` it blocks indefinitely at least 1/3 of the 
> times. However it passes ok with Maven.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4066) Nexmark fails when running with PubSub as source/sink

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437551#comment-16437551
 ] 

Alexey Romanenko commented on BEAM-4066:


The root cause of this issue - the anonymous inner class for DoFn is not 
serialisable, so we need to extract them. I'll provide a PR for that. 

> Nexmark fails when running with PubSub as source/sink
> -
>
> Key: BEAM-4066
> URL: https://issues.apache.org/jira/browse/BEAM-4066
> Project: Beam
>  Issue Type: Bug
>  Components: examples-nexmark
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
> Attachments: gistfile1.txt
>
>
> Running Nexmark with PubSub cause this exception:
> {noformat}
> Caused by: java.io.NotSerializableException: 
> org.apache.beam.sdk.nexmark.NexmarkLauncher
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4066) Nexmark fails when running with PubSub as source/sink

2018-04-13 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-4066:
---
Attachment: gistfile1.txt

> Nexmark fails when running with PubSub as source/sink
> -
>
> Key: BEAM-4066
> URL: https://issues.apache.org/jira/browse/BEAM-4066
> Project: Beam
>  Issue Type: Bug
>  Components: examples-nexmark
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
> Attachments: gistfile1.txt
>
>
> Running Nexmark with PubSub cause this exception:
> {noformat}
> Caused by: java.io.NotSerializableException: 
> org.apache.beam.sdk.nexmark.NexmarkLauncher
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4066) Nexmark fails when running with PubSub as source/sink

2018-04-13 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-4066:
--

 Summary: Nexmark fails when running with PubSub as source/sink
 Key: BEAM-4066
 URL: https://issues.apache.org/jira/browse/BEAM-4066
 Project: Beam
  Issue Type: Bug
  Components: examples-nexmark
Reporter: Alexey Romanenko
Assignee: Alexey Romanenko


Running Nexmark with PubSub cause this exception:

{noformat}
Caused by: java.io.NotSerializableException: 
org.apache.beam.sdk.nexmark.NexmarkLauncher
{noformat}

Full log is attached.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437513#comment-16437513
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] 
 1. BEAM-3424 I agree with what you suggested as a split strategy. The only 
concern for me is, as it was original cause from StackOverflow question, that 
if user runs a pipeline from local machine and Cassandra instance is located in 
different network, then we can't estimate number of splits reliably. So, for 
this case, we perhaps could to provide an option to set number of splits 
manually, though, Beam doesn't greet additional tuning knobs that are not very 
necessary. Do you think it can be another solution for this? Default options?
 2. BEAM-3425 I'm just curious if _Long_ was not enough for that?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437332#comment-16437332
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] Could you run this command locally? 
{{./gradlew -p sdks/java/io/cassandra build}}
I think it was caused by unused filed 
CassandraServiceImpl$CassandraReaderImpl.partitioner

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437245#comment-16437245
 ] 

Alexey Romanenko edited comment on BEAM-3485 at 4/13/18 12:23 PM:
--

[~adejanovski] Thank you for working on this. I assume that your PR fixes other 
two related issues as well - BEAM-3424 and BEAM-3425. Is it correct?


was (Author: aromanenko):
[~adejanovski] Thank you for working on this. I assume that your PR fixes other 
two issues as well - BEAM-3424 and BEAM-3425. Is it correct?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437245#comment-16437245
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] Thank you for working on this. I assume that your PR fixes other 
two issues as well - BEAM-3424 and BEAM-3425. Is it correct?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3425) CassandraIO fails to estimate size: Codec not found for requested operation: [varchar <-> java.lang.Long]

2018-04-10 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3425:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> CassandraIO fails to estimate size: Codec not found for requested operation: 
> [varchar <-> java.lang.Long]
> -
>
> Key: BEAM-3425
> URL: https://issues.apache.org/jira/browse/BEAM-3425
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See exception in 
> https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264#48131264
>  .
> The exception comes from 
> https://github.com/apache/beam/blob/master/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraServiceImpl.java#L279
>  , where I suppose "range_start" and "range_end" are really varchar, but the 
> code expects them to be long.
> Indeed they are varchar: 
> https://github.com/apache/cassandra/blob/4c80eeece37d79f434078224a0504400ae10a20d/src/java/org/apache/cassandra/db/SystemKeyspace.java#L238
>  and have been for at least the past 3 years.
> However really they seem to be storing longs: 
> https://github.com/apache/cassandra/blob/95b43b195e4074533100f863344c182a118a8b6c/src/java/org/apache/cassandra/hadoop/cql3/CqlInputFormat.java#L229
> So I guess all that needs to be fixed is adding a Long.parseLong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-10 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3485:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3424) CassandraIO uses 1 split if can't estimate size

2018-04-10 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3424:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> CassandraIO uses 1 split if can't estimate size
> ---
>
> Key: BEAM-3424
> URL: https://issues.apache.org/jira/browse/BEAM-3424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>
> See 
> https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra?noredirect=1#comment83227824_48090668
>  . When CassandraIO can't estimate size, it falls back to a single split:
> https://github.com/apache/beam/blob/master/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraServiceImpl.java#L196
> A single split is very poor for performance. We should fall back to a 
> different value. Not sure what a good value would be; probably the largest 
> value that still doesn't introduce too much per-split overhead? E.g. would 
> there be any downside to just changing that number to 100?
> Alternatively/additionally, like in DatastoreIO, CassandraIO could accept 
> requested number of splits as a parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-2852) Add support for Kafka as source/sink on Nexmark

2018-04-05 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-2852:
--

Assignee: Alexey Romanenko  (was: Kai Jiang)

> Add support for Kafka as source/sink on Nexmark
> ---
>
> Key: BEAM-2852
> URL: https://issues.apache.org/jira/browse/BEAM-2852
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Ismaël Mejía
>Assignee: Alexey Romanenko
>Priority: Minor
>  Labels: newbie, nexmark, starter
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2852) Add support for Kafka as source/sink on Nexmark

2018-04-05 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427272#comment-16427272
 ] 

Alexey Romanenko commented on BEAM-2852:


I'd take this one since there was not update for a while. I closed old PR and 
created new one with keeping of all authorship of [~vectorijk] of original 
commits (only squashed them into one).

> Add support for Kafka as source/sink on Nexmark
> ---
>
> Key: BEAM-2852
> URL: https://issues.apache.org/jira/browse/BEAM-2852
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Ismaël Mejía
>Assignee: Kai Jiang
>Priority: Minor
>  Labels: newbie, nexmark, starter
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3946) Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set

2018-04-05 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426717#comment-16426717
 ] 

Alexey Romanenko commented on BEAM-3946:


Thank you!

> Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set
> -
>
> Key: BEAM-3946
> URL: https://issues.apache.org/jira/browse/BEAM-3946
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python
>Reporter: Alexey Romanenko
>Assignee: Mark Liu
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
> Python tests:
> {{ERROR: test_message_matcher_mismatch 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_matcher_success 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_metcher_timeout 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  
> with an error:
> DefaultCredentialsError: Could not automatically determine credentials. 
> Please set GOOGLE_APPLICATION_CREDENTIALS or
>  explicitly create credential and re-run the application. For more
>  information, please see
>  
> [https://developers.google.com/accounts/docs/application-default-credentials].
>   >> begin captured logging << 
>  google.auth.transport._http_client: DEBUG: Making request: GET 
> [http://169.254.169.254|http://169.254.169.254/]
>  google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
> unavailable.
>  - >> end captured logging << -
>  
> It looks like it's a regression and it was caused by this commit: 
> [301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-04-05 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426686#comment-16426686
 ] 

Alexey Romanenko edited comment on BEAM-3726 at 4/5/18 9:55 AM:


Fixed by BEAM-3087
Thank you to [~pawelbartoszek] for reporting this and help with reproducing.


was (Author: aromanenko):
Fixed by BEAM-3087


> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Fix For: 2.5.0
>
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-04-05 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426686#comment-16426686
 ] 

Alexey Romanenko edited comment on BEAM-3726 at 4/5/18 9:53 AM:


Fixed by BEAM-3087



was (Author: aromanenko):
Fixed by [BEAM-3087|https://issues.apache.org/jira/browse/BEAM-3087]


> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Fix For: 2.5.0
>
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-04-05 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko resolved BEAM-3726.

   Resolution: Fixed
Fix Version/s: 2.5.0

Fixed by [BEAM-3087|https://issues.apache.org/jira/browse/BEAM-3087]


> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Fix For: 2.5.0
>
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3819) Add withRequestRecordsLimit option to KinesisIO

2018-03-28 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3819:
---
Summary: Add withRequestRecordsLimit option to KinesisIO  (was: Add 
withLimit() option to KinesisIO)

> Add withRequestRecordsLimit option to KinesisIO
> ---
>
> Key: BEAM-3819
> URL: https://issues.apache.org/jira/browse/BEAM-3819
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kinesis
>Reporter: Jean-Baptiste Onofré
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In some cases, the user might need to set the {{limit}} on the 
> {{SimplifiedKinesisClient}}, especially for performance reason, depending of 
> the number of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-03-27 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415183#comment-16415183
 ] 

Alexey Romanenko edited comment on BEAM-3726 at 3/27/18 3:16 PM:
-

[~pawelbartoszek] The issue BEAM-3087 was fixed. Could you check that it fixes 
current issue as well?


was (Author: aromanenko):
[~pawelbartoszek] The issue BEAM-3087 was fixed. Could you check that this also 
fixes current issue?

> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3946) Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set

2018-03-27 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3946:
---
Description: 
Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
Python tests:

{{ERROR: test_message_matcher_mismatch 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_matcher_success 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_metcher_timeout 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}

 

with an error:

DefaultCredentialsError: Could not automatically determine credentials. Please 
set GOOGLE_APPLICATION_CREDENTIALS or
 explicitly create credential and re-run the application. For more
 information, please see
 [https://developers.google.com/accounts/docs/application-default-credentials].
  >> begin captured logging << 
 google.auth.transport._http_client: DEBUG: Making request: GET 
[http://169.254.169.254|http://169.254.169.254/]
 google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
unavailable.
 - >> end captured logging << -

 

It looks like it's a regression and it was caused by this commit: 
[301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]

  was:
Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
Python tests:

{{ERROR: test_message_matcher_mismatch 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_matcher_success 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_metcher_timeout 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}

 

with an error:

{{DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or
explicitly create credential and re-run the application. For more
information, please see
https://developers.google.com/accounts/docs/application-default-credentials.
 >> begin captured logging << 
google.auth.transport._http_client: DEBUG: Making request: GET 
http://169.254.169.254
google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
unavailable.
- >> end captured logging << -}}

 

It looks like it's a regression and it was caused by this commit: 
[301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]


> Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set
> -
>
> Key: BEAM-3946
> URL: https://issues.apache.org/jira/browse/BEAM-3946
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python
>Reporter: Alexey Romanenko
>Assignee: Ahmet Altay
>Priority: Major
>
> Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
> Python tests:
> {{ERROR: test_message_matcher_mismatch 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_matcher_success 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_metcher_timeout 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  
> with an error:
> DefaultCredentialsError: Could not automatically determine credentials. 
> Please set GOOGLE_APPLICATION_CREDENTIALS or
>  explicitly create credential and re-run the application. For more
>  information, please see
>  
> [https://developers.google.com/accounts/docs/application-default-credentials].
>   >> begin captured logging << 
>  google.auth.transport._http_client: DEBUG: Making request: GET 
> [http://169.254.169.254|http://169.254.169.254/]
>  google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
> unavailable.
>  - >> end captured logging << -
>  
> It looks like it's a regression and it was caused by this commit: 
> [301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3946) Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set

2018-03-27 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3946:
---
Description: 
Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
Python tests:

{{ERROR: test_message_matcher_mismatch 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_matcher_success 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_metcher_timeout 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}

 

with an error:

{{DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or
explicitly create credential and re-run the application. For more
information, please see
https://developers.google.com/accounts/docs/application-default-credentials.
 >> begin captured logging << 
google.auth.transport._http_client: DEBUG: Making request: GET 
http://169.254.169.254
google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
unavailable.
- >> end captured logging << -}}

 

It looks like it's a regression and it was caused by this commit: 
[301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]

  was:
Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
Python tests:

{{ERROR: test_message_matcher_mismatch 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_matcher_success 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_metcher_timeout 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}

 

{{with an error:}}
{{ DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or}}
{{ explicitly create credential and re-run the application. For more}}
{{ information, please see}}
{{ 
[https://developers.google.com/accounts/docs/application-default-credentials].}}
{{  >> begin captured logging << }}
{{ google.auth.transport._http_client: DEBUG: Making request: GET 
[http://169.254.169.254|http://169.254.169.254/]}}
{{ google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
unavailable.}}
{{ - >> end captured logging << -}}

 

It looks like it's a regression and it was caused by this commit: 
[301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]


> Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set
> -
>
> Key: BEAM-3946
> URL: https://issues.apache.org/jira/browse/BEAM-3946
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python
>Reporter: Alexey Romanenko
>Assignee: Ahmet Altay
>Priority: Major
>
> Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
> Python tests:
> {{ERROR: test_message_matcher_mismatch 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_matcher_success 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  {{ERROR: test_message_metcher_timeout 
> (apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
>  
> with an error:
> {{DefaultCredentialsError: Could not automatically determine credentials. 
> Please set GOOGLE_APPLICATION_CREDENTIALS or
> explicitly create credential and re-run the application. For more
> information, please see
> https://developers.google.com/accounts/docs/application-default-credentials.
>  >> begin captured logging << 
> google.auth.transport._http_client: DEBUG: Making request: GET 
> http://169.254.169.254
> google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
> unavailable.
> - >> end captured logging << -}}
>  
> It looks like it's a regression and it was caused by this commit: 
> [301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3946) Python SDK tests are failing if no GOOGLE_APPLICATION_CREDENTIALS was set

2018-03-27 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-3946:
--

 Summary: Python SDK tests are failing if no 
GOOGLE_APPLICATION_CREDENTIALS was set
 Key: BEAM-3946
 URL: https://issues.apache.org/jira/browse/BEAM-3946
 Project: Beam
  Issue Type: Bug
  Components: examples-python
Reporter: Alexey Romanenko
Assignee: Ahmet Altay


Running locally mvn clean install fails on following Apache Beam :: SDKs :: 
Python tests:

{{ERROR: test_message_matcher_mismatch 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_matcher_success 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}
 {{ERROR: test_message_metcher_timeout 
(apache_beam.io.gcp.tests.pubsub_matcher_test.PubSubMatcherTest)}}

 

{{with an error:}}
{{ DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or}}
{{ explicitly create credential and re-run the application. For more}}
{{ information, please see}}
{{ 
[https://developers.google.com/accounts/docs/application-default-credentials].}}
{{  >> begin captured logging << }}
{{ google.auth.transport._http_client: DEBUG: Making request: GET 
[http://169.254.169.254|http://169.254.169.254/]}}
{{ google.auth.compute_engine._metadata: INFO: Compute Engine Metadata server 
unavailable.}}
{{ - >> end captured logging << -}}

 

It looks like it's a regression and it was caused by this commit: 
[301853647f2c726c04c5bdb02cab6ff6b39f09d0|https://github.com/apache/beam/commit/301853647f2c726c04c5bdb02cab6ff6b39f09d0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-03-27 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415183#comment-16415183
 ] 

Alexey Romanenko commented on BEAM-3726:


[~pawelbartoszek] The issue BEAM-3087 was fixed. Could you check that this also 
fixes current issue?

> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-03-27 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3726:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Alexey Romanenko
>Priority: Major
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-03-27 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko updated BEAM-3726:
---
Issue Type: Bug  (was: Improvement)

> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Jean-Baptiste Onofré
>Priority: Major
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3881) Failure reading backlog in KinesisIO

2018-03-26 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413452#comment-16413452
 ] 

Alexey Romanenko commented on BEAM-3881:


[~kvncp] Great to hear that it was already fixed and it helped you. Thank you 
for checking this!

> Failure reading backlog in KinesisIO
> 
>
> Key: BEAM-3881
> URL: https://issues.apache.org/jira/browse/BEAM-3881
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Reporter: Kevin Peterson
>Assignee: Alexey Romanenko
>Priority: Major
> Fix For: 2.4.0
>
>
> I'm getting an error when reading from Kinesis in my pipeline. Using Beam 
> v2.3, running on Google Cloud Dataflow.
> I'm constructing the source via:
> {code:java}
> KinesisIO.Read read = KinesisIO
> .read()
> .withAWSClientsProvider(
> configuration.getAwsAccessKeyId(),
> configuration.getAwsSecretAccessKey(),
> region)
> .withStreamName(configuration.getKinesisStream())
> .withUpToDateThreshold(Duration.standardMinutes(30))
> .withInitialTimestampInStream(configuration.getStartTime());
> {code}
> The exception is:
> {noformat}
> Mar 19, 2018 12:54:41 PM 
> org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process
> SEVERE: 2018-03-19T19:54:53.010Z: (2896b8774de760ec): 
> java.lang.RuntimeException: Unknown kinesis failure, when trying to reach 
> kinesis
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:223)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748225435
> org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:206)
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
> org.joda.time.Minutes.minutesBetween(Minutes.java:101)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:163)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:205)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3912) Add support of HadoopOutputFormatIO

2018-03-22 Thread Alexey Romanenko (JIRA)
Alexey Romanenko created BEAM-3912:
--

 Summary: Add support of HadoopOutputFormatIO
 Key: BEAM-3912
 URL: https://issues.apache.org/jira/browse/BEAM-3912
 Project: Beam
  Issue Type: Improvement
  Components: io-java-hadoop
Reporter: Alexey Romanenko
Assignee: Alexey Romanenko


For the moment, there is only HadoopInputFormatIO in Beam. To provide a support 
of different writing IOs, that are not yet natively supported in Beam (for 
example, Apache Orc or HBase bulk load), it would make sense to add 
HadoopOutputFormatIO as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3881) Failure reading backlog in KinesisIO

2018-03-21 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407937#comment-16407937
 ] 

Alexey Romanenko commented on BEAM-3881:


Unfortunately, I can't reproduce this issue using the same pipeline. 

[~kvncp] could you provide more details about that, like:
 # Is it permanently reproducible on your side or it happens only from time to 
time?
 # Does it happen right in the beginning of your pipeline run or after a while? 
Did you manage to read any messages before?
 # How many shards your Kinesis stream has?
 # What is actual returning value of {{configuration.getStartTime()}}?
 # It would be very helpful if you could provide a code snippet which 
constantly reproduces this issue.

> Failure reading backlog in KinesisIO
> 
>
> Key: BEAM-3881
> URL: https://issues.apache.org/jira/browse/BEAM-3881
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Reporter: Kevin Peterson
>Assignee: Alexey Romanenko
>Priority: Major
>
> I'm getting an error when reading from Kinesis in my pipeline. Using Beam 
> v2.3, running on Google Cloud Dataflow.
> I'm constructing the source via:
> {code:java}
> KinesisIO.Read read = KinesisIO
> .read()
> .withAWSClientsProvider(
> configuration.getAwsAccessKeyId(),
> configuration.getAwsSecretAccessKey(),
> region)
> .withStreamName(configuration.getKinesisStream())
> .withUpToDateThreshold(Duration.standardMinutes(30))
> .withInitialTimestampInStream(configuration.getStartTime());
> {code}
> The exception is:
> {noformat}
> Mar 19, 2018 12:54:41 PM 
> org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process
> SEVERE: 2018-03-19T19:54:53.010Z: (2896b8774de760ec): 
> java.lang.RuntimeException: Unknown kinesis failure, when trying to reach 
> kinesis
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:223)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748225435
> org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:206)
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
> org.joda.time.Minutes.minutesBetween(Minutes.java:101)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:163)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:205)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3881) Failure reading backlog in KinesisIO

2018-03-20 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3881:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> Failure reading backlog in KinesisIO
> 
>
> Key: BEAM-3881
> URL: https://issues.apache.org/jira/browse/BEAM-3881
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Reporter: Kevin Peterson
>Assignee: Alexey Romanenko
>Priority: Major
>
> I'm getting an error when reading from Kinesis in my pipeline. Using Beam 
> v2.3, running on Google Cloud Dataflow.
> I'm constructing the source via:
> {code:java}
> KinesisIO.Read read = KinesisIO
> .read()
> .withAWSClientsProvider(
> configuration.getAwsAccessKeyId(),
> configuration.getAwsSecretAccessKey(),
> region)
> .withStreamName(configuration.getKinesisStream())
> .withUpToDateThreshold(Duration.standardMinutes(30))
> .withInitialTimestampInStream(configuration.getStartTime());
> {code}
> The exception is:
> {noformat}
> Mar 19, 2018 12:54:41 PM 
> org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process
> SEVERE: 2018-03-19T19:54:53.010Z: (2896b8774de760ec): 
> java.lang.RuntimeException: Unknown kinesis failure, when trying to reach 
> kinesis
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:223)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748225435
> org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:206)
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
> org.joda.time.Minutes.minutesBetween(Minutes.java:101)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:163)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:205)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:161)
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:150)
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:200)
> com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:398)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1199)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:137)
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:940)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3726) Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move backwards

2018-03-15 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400152#comment-16400152
 ] 

Alexey Romanenko commented on BEAM-3726:


It looks like that this issue is related to this one: BEAM-3087

> Kinesis Reader: java.lang.IllegalArgumentException: Attempting to move 
> backwards
> 
>
> Key: BEAM-3726
> URL: https://issues.apache.org/jira/browse/BEAM-3726
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kinesis
>Affects Versions: 2.2.0
>Reporter: Pawel Bartoszek
>Assignee: Jean-Baptiste Onofré
>Priority: Major
> Attachments: KinesisIO-state.txt
>
>
> When the job is restored from savepoint Kinesis Reader throws almost always 
> {{java.lang.IllegalArgumentException: Attempting to move backwards}}
> After a few job restarts caused again by the same exception, job finally 
> starts up and continues to run with no further problems.
> Beam job is reading from 32 shards with parallelism set to 32. Using Flink 
> 1.3.2. But I have seen this exception also when using Beam 2.2 when Kinesis 
> client was refactored to use MovingFunction. I think this is a serious 
> regression bug introduced in Beam 2.2. 
>  
> {code:java}
> java.lang.IllegalArgumentException: Attempting to move backwards
> at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.beam.sdk.util.MovingFunction.flush(MovingFunction.java:97)
> at org.apache.beam.sdk.util.MovingFunction.add(MovingFunction.java:114)
> at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.advance(KinesisReader.java:137)
> at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeAdvance(ReaderInvocationUtil.java:67)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:264)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
> at 
> org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:39)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702){code}
>  
> Kinesis Reader transformation configuration:
> {code:java}
> pipeline.apply("KINESIS READER", KinesisIO.read()
> .withStreamName(streamName)
> .withInitialPositionInStream(InitialPositionInStream.LATEST)
> .withAWSClientsProvider(awsAccessKey, awsSecretKey, EU_WEST_1)){code}
>  
> When testing locally I managed to catch this exception. Just before executing 
> this 
> [link|https://github.com/apache/beam/blob/6c93105c2cb7be709c6b3e2e6cdcd09df2b48308/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MovingFunction.java#L97]
>  that threw exception I captured the state of the class so that you can 
> replicate the issue
> {code:java}
> org.apache.beam.sdk.util.MovingFunction@71781a[sampleUpdateMs=5000,numSignificantBuckets=2,numSignificantSamples=10,function=org.apache.beam.sdk.transforms.Min$MinLongFn@7909d8d3,buckets={9223372036854775807,9223372036854775807,1519315344334,1519315343759,1519315343770,1519315344086,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807,9223372036854775807},numSamples={0,0,1,158,156,146,0,0,0,0,144,0},currentMsSinceEpoch=1519315585000,currentIndex=2]{code}
>  
> the add function of MovingFunction was called with nowMsSinceEpoch = 
> 1519315583591
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-3087) Extend lock scope in Flink UnboundedSourceWrapper

2018-03-14 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398841#comment-16398841
 ] 

Alexey Romanenko edited comment on BEAM-3087 at 3/14/18 4:46 PM:
-

Does anyone work on this? 

It looks like it's related to this issue: 
[BEAM-3726|https://issues.apache.org/jira/browse/BEAM-3726], especially, to 
this Pawel's 
[comment|https://issues.apache.org/jira/browse/BEAM-3726?focusedCommentId=16383882=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16383882].
 

This call (actually, two calls inside this method) should be synchronized with 
other reader calls. Otherwise, we can have potentially race condition there.
{code}
public class UnboundedSourceWrapper
...
@Override
  public void run(SourceContext> ctx) 
throws Exception {
  ...
while (isRunning) {
dataAvailable = readerInvoker.invokeAdvance(reader);
...
{code}

Does this assumption make sense?
 

 


was (Author: aromanenko):
Does anyone work on this? 

It looks like it's related to this issue: 
[BEAM-3726|https://issues.apache.org/jira/browse/BEAM-3726], especially, to 
this Pawel's 
[comment|https://issues.apache.org/jira/browse/BEAM-3726?focusedCommentId=16383882=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16383882].
 

This call (actually, two calls inside this method) should be synchronized with 
other reader calls
{code}
public class UnboundedSourceWrapper
...
@Override
  public void run(SourceContext> ctx) 
throws Exception {
  ...
while (isRunning) {
dataAvailable = readerInvoker.invokeAdvance(reader);
...
{code}

Does this assumption make sense?
 

 

> Extend lock scope in Flink UnboundedSourceWrapper
> -
>
> Key: BEAM-3087
> URL: https://issues.apache.org/jira/browse/BEAM-3087
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Priority: Major
>
> In {{UnboundedSourceWrapper}} the lock scope is not big enough: we 
> synchronise in {{emitElement()}} but should instead synchronise inside the 
> reader loop in {{run()}} because the {{Source}} interface does not allow 
> concurrent calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3087) Extend lock scope in Flink UnboundedSourceWrapper

2018-03-14 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398841#comment-16398841
 ] 

Alexey Romanenko commented on BEAM-3087:


Does anyone work on this? 

It looks like it's related to this issue: 
[BEAM-3726|https://issues.apache.org/jira/browse/BEAM-3726], especially, to 
this Pawel's 
[comment|https://issues.apache.org/jira/browse/BEAM-3726?focusedCommentId=16383882=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16383882].
 

This call (actually, two calls inside this method) should be synchronized with 
other reader calls
{code}
public class UnboundedSourceWrapper
...
@Override
  public void run(SourceContext> ctx) 
throws Exception {
  ...
while (isRunning) {
dataAvailable = readerInvoker.invokeAdvance(reader);
...
{code}

Does this assumption make sense?
 

 

> Extend lock scope in Flink UnboundedSourceWrapper
> -
>
> Key: BEAM-3087
> URL: https://issues.apache.org/jira/browse/BEAM-3087
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Priority: Major
>
> In {{UnboundedSourceWrapper}} the lock scope is not big enough: we 
> synchronise in {{emitElement()}} but should instead synchronise inside the 
> reader loop in {{run()}} because the {{Source}} interface does not allow 
> concurrent calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3819) Add withLimit() option to KinesisIO

2018-03-13 Thread Alexey Romanenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Romanenko reassigned BEAM-3819:
--

Assignee: Alexey Romanenko  (was: Jean-Baptiste Onofré)

> Add withLimit() option to KinesisIO
> ---
>
> Key: BEAM-3819
> URL: https://issues.apache.org/jira/browse/BEAM-3819
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kinesis
>Reporter: Jean-Baptiste Onofré
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In some cases, the user might need to set the {{limit}} on the 
> {{SimplifiedKinesisClient}}, especially for performance reason, depending of 
> the number of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3605) Kinesis ShardReadersPoolTest shouldForgetClosedShardIterator failure

2018-02-20 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370144#comment-16370144
 ] 

Alexey Romanenko commented on BEAM-3605:


Finally, it was fixed by using `Mockito.timeout`

> Kinesis ShardReadersPoolTest shouldForgetClosedShardIterator failure
> 
>
> Key: BEAM-3605
> URL: https://issues.apache.org/jira/browse/BEAM-3605
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Reporter: Kenneth Knowles
>Assignee: Paweł Kaczmarczyk
>Priority: Critical
>  Labels: flake, sickbay
> Fix For: 2.4.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Here's one:
> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/1758/testReport/junit/org.apache.beam.sdk.io.kinesis/ShardReadersPoolTest/shouldForgetClosedShardIterator/
> Filing all test failures as "Critical" so we can sickbay or fix.
> The Jenkins build will get GC'd so here is the error:
> {code}
> java.lang.AssertionError: 
> Expecting:
>   <["shard1", "shard2"]>
> to contain only:
>   <["shard2"]>
> but the following elements were unexpected:
>   <["shard1"]>
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPoolTest.shouldForgetClosedShardIterator(ShardReadersPoolTest.java:270)
> {code}
> and stderr
> {code}
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:19 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:19 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException: Shard iterator 
> reached end of the shard: streamName=null, shardId=shard1
>   at 
> org.apache.beam.sdk.io.kinesis.ShardRecordsIterator.readNextBatch(ShardRecordsIterator.java:70)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.readLoop(ShardReadersPool.java:121)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.lambda$startReadingShards$0(ShardReadersPool.java:112)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard2 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException: Shard iterator 
> reached end of the shard: streamName=null, shardId=shard2
>   at 
> org.apache.beam.sdk.io.kinesis.ShardRecordsIterator.readNextBatch(ShardRecordsIterator.java:70)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.readLoop(ShardReadersPool.java:121)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.lambda$startReadingShards$0(ShardReadersPool.java:112)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool stop
> INFO: Closing shard iterators pool
> Feb 01, 2018 11:24:24 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop 

[jira] [Comment Edited] (BEAM-3605) Kinesis ShardReadersPoolTest shouldForgetClosedShardIterator failure

2018-02-20 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370144#comment-16370144
 ] 

Alexey Romanenko edited comment on BEAM-3605 at 2/20/18 3:32 PM:
-

Finally, it was fixed by using _Mockito.timeout_


was (Author: aromanenko):
Finally, it was fixed by using `Mockito.timeout`

> Kinesis ShardReadersPoolTest shouldForgetClosedShardIterator failure
> 
>
> Key: BEAM-3605
> URL: https://issues.apache.org/jira/browse/BEAM-3605
> Project: Beam
>  Issue Type: Bug
>  Components: z-do-not-use-sdk-java-extensions
>Reporter: Kenneth Knowles
>Assignee: Paweł Kaczmarczyk
>Priority: Critical
>  Labels: flake, sickbay
> Fix For: 2.4.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Here's one:
> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/1758/testReport/junit/org.apache.beam.sdk.io.kinesis/ShardReadersPoolTest/shouldForgetClosedShardIterator/
> Filing all test failures as "Critical" so we can sickbay or fix.
> The Jenkins build will get GC'd so here is the error:
> {code}
> java.lang.AssertionError: 
> Expecting:
>   <["shard1", "shard2"]>
> to contain only:
>   <["shard2"]>
> but the following elements were unexpected:
>   <["shard1"]>
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPoolTest.shouldForgetClosedShardIterator(ShardReadersPoolTest.java:270)
> {code}
> and stderr
> {code}
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:16 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:19 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException
> Feb 01, 2018 11:24:19 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard1 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException: Shard iterator 
> reached end of the shard: streamName=null, shardId=shard1
>   at 
> org.apache.beam.sdk.io.kinesis.ShardRecordsIterator.readNextBatch(ShardRecordsIterator.java:70)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.readLoop(ShardReadersPool.java:121)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.lambda$startReadingShards$0(ShardReadersPool.java:112)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Kinesis Shard read loop has finished
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool 
> readLoop
> INFO: Shard iterator for shard2 shard is closed, finishing the read loop
> org.apache.beam.sdk.io.kinesis.KinesisShardClosedException: Shard iterator 
> reached end of the shard: streamName=null, shardId=shard2
>   at 
> org.apache.beam.sdk.io.kinesis.ShardRecordsIterator.readNextBatch(ShardRecordsIterator.java:70)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.readLoop(ShardReadersPool.java:121)
>   at 
> org.apache.beam.sdk.io.kinesis.ShardReadersPool.lambda$startReadingShards$0(ShardReadersPool.java:112)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Feb 01, 2018 11:24:23 PM org.apache.beam.sdk.io.kinesis.ShardReadersPool stop
> INFO: Closing shard 

  1   2   >