Re: Command line to run DatstoreIO integration tests for java

2021-08-10 Thread Miguel Anzo Palomo
Hi Alex,
My problem trying to locally run that test was that, since I don't have
access to the apache-beam-testing GCP project I was having problems getting
a datastore instance to read/write from.

On Tue, Aug 10, 2021 at 1:08 PM Alex Amato  wrote:

> Miguel, I believe you were having trouble running this test, you mentioned
> that you needed a Datastore instance.
> Would you mind sharing the errors or any issue you had when you tried Ke's
> command?
>
> Ke, Do we need to specify a datastore instance to run this test? Where
> will it read/write from? I assume it will it default to something in
> the apache-beam-testing GCP project, as other tests do.
>
>
> On Thu, Jul 29, 2021 at 6:16 PM Ke Wu  wrote:
>
>> “Run SQL PostCommit” is essentially running “./gradlew :sqlPostCommit” [1]
>>
>> In your case where you would like to run DataStoreReadWriteIT only, it
>> can be simplified [2] to
>>
>> "./gradlew :sdks:java:extensions:sql:postCommit”
>>
>> Best,
>> Ke
>>
>> [1]
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_SQL.groovy#L41
>>
>> [2] https://github.com/apache/beam/blob/master/build.gradle.kts#L190
>>
>>
>> On Jul 29, 2021, at 9:58 AM, Alex Amato  wrote:
>>
>> I was hoping for the command line to run it. So that the test could be
>> tweaked to inject an error, and ensure the error handling code works as
>> expected
>>
>> On Wed, Jul 28, 2021 at 8:34 PM Ke Wu  wrote:
>>
>>> Comment the PR with "Run SQL PostCommit” would trigger the post commit
>>> integration tests for SQL, which I suppose includes DataStoreReadWriteIT
>>>
>>> Let me know if whether or not this is sufficient.
>>>
>>> Best,
>>> Ke
>>>
>>> On Jul 28, 2021, at 12:20 PM, Alex Amato  wrote:
>>>
>>> Is it possible to run a Datastore IO integration test to test this PR?
>>>
>>> https://github.com/apache/beam/pull/15183/files
>>>
>>> Probably this test can be ran somehow. Though I don't know the gradle
>>> command to run it
>>>
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/datastore/DataStoreReadWriteIT.java
>>>
>>> Does anyone know how to run this test?
>>>
>>>
>>>
>>

-- 

Miguel Angel Anzo Palomo | WIZELINE

Software Engineer

miguel.a...@wizeline.com

Remote Office

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
>
> It is likely that the incorrect transform was edited...
>

It appears you're right; I tried to reproduce but  this time was able to
clear the issue by making "the same" code change and updating the
pipeline.  I believe it was just a change in the wrong place in code.

Good to know this works! Thanks Luke 

On Tue, Aug 10, 2021 at 1:19 PM Luke Cwik  wrote:

>
>
> On Tue, Aug 10, 2021 at 10:11 AM Evan Galpin 
> wrote:
>
>> Thanks for your responses Luke. One point I have confusion over:
>>
>> * Modify the sink implementation to do what you want with the bad data
>>> and update the pipeline.
>>>
>>
>>  I modified the sink implementation to ignore the specific error that was
>> the problem and updated the pipeline. The update succeeded.  However, the
>> problem persisted.  How might that happen?  Is there caching involved?
>> Checkpointing? I changed the very last method called in the pipeline in
>> order to ensure the validation would apply, but the problem persisted.
>>
>
> It is likely that the incorrect transform was edited. You should take a
> look at the worker logs and find the exception that is being thrown and
> find the step name it is associated with (e.g.
> BigQueryIO/Write/StreamingInserts) and browse the source for a
> "StreamingInserts" transform that is applied from the "Write" transform
> which is applied from the BigQueryIO transform.
>
>
>>
>> And in the case where one is using a Sink which is a built-in IO module
>> of Beam, modification of the Sink may not be immediately feasible. Is the
>> only recourse in that case to drain a job an start a new one?
>>
>> The Beam IOs are open source allowing you to edit the code and build a
> new version locally which you would consume in your project. Dataflow does
> have an optimization and replaces the pubsub source/sink but all others to
> my knowledge should be based upon the Apache Beam source.
>
>
>> On Tue, Aug 10, 2021 at 12:54 PM Luke Cwik  wrote:
>>
>>>
>>>
>>> On Tue, Aug 10, 2021 at 8:54 AM Evan Galpin 
>>> wrote:
>>>
 Hi all,

 I recently had an experience where a streaming pipeline became
 "clogged" due to invalid data reaching the final step in my pipeline such
 that the data was causing a non-transient error when writing to my Sink.
 Since the job is a streaming job, the element (bundle) was continuously
 retrying.

 What options are there for getting out of this state when it occurs?

>>>
>>> * Modify the sink implementation to do what you want with the bad data
>>> and update the pipeline.
>>> * Cancel the pipeline and update the implementation to handle the bad
>>> records and rerun from last known good position reprocessing whatever was
>>> necessary.
>>>
>>> I attempted to add validation and update the streaming job to remove the
 bad entity; though the update was successful, I believe the bad entity was
 already checkpointed (?) further downstream in the pipeline. What then?

>>> And for something like a database schema and evolving it over time, what
 is the typical solution?

>>>
>>> * Pipeline update containing newly updated schema before data with new
>>> schema starts rolling in.
>>> * Use a format and encoding that is agnostic to changes with a
>>> source/sink that can consume this agnostic format. See this thread[1] and
>>> others like it in the user and dev mailing lists.
>>>
>>>
 - Should pipelines mirror a DB schema and do validation of all data
 types in the pipeline?

>>>
>>> Perform validation at critical points in the pipeline like data ingress
>>> and egress. Insertion of the data into the DB failing via a dead letter
>>> queue works for the cases that are loud and throw exceptions but for the
>>> cases where they are inserted successfully but are still invalid from a
>>> business logic standpoint won't be caught without validation.
>>>
>>>
 - Should all sinks implement a way to remove non-transient failures
 from retrying and output them via PCollectionTuple (such as with BigQuery
 failed inserts)?

>>>
>>> Yes, dead letter queues (DLQs) are quite a good solution for this since
>>> it provides a lot of flexibility and allows for a process to fix it up
>>> (typically a manual process).
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/r4b31c8b76fa81dcb130397077b981ab6429f2999b9d864c815214c5a%40%3Cuser.beam.apache.org%3E
>>>
>>


Re: Contributor permission for beam Jira tickets

2021-08-10 Thread Chris Hinds
Thanks!

On 10 Aug 2021, at 18:21, Luke Cwik mailto:lc...@google.com>> 
wrote:

Welcome.

I have granted you the access you requested.

Please take a look at the contribution guide[1] as it helps answer many 
questions.

1: https://beam.apache.org/contribute/

On Tue, Aug 10, 2021 at 10:06 AM Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:
Hello,

I’ve been dipping in an out of Beam as a user for a while. I came across an 
issue and I’d like to assign it to myself (username: chrishinds).

Thanks!
Chris.



Flaky test issue report (32)

2021-08-10 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12387: beam_PostCommit_Python* 
timing out (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12386: 
beam_PostCommit_Py_VR_Dataflow(_V2) failing metrics tests (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12322: 
FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12309: 
PubSubIntegrationTest.test_streaming_data_only flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12307: 
PubSubBigQueryIT.test_file_loads flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12303: Flake in 
PubSubIntegrationTest.test_streaming_with_attributes (created 2021-05-06)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-11837: Java build flakes: 
"Memory constraints are impeding performance" (created 2021-02-18)
https://issues.apache.org/jira/browse/BEAM-11792: Python precommit failed 
(flaked?) installing package  (created 2021-02-10)
https://issues.apache.org/jira/browse/BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (created 2021-01-20)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11645: beam_PostCommit_XVR_Flink 
failing (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11641: Bigquery Read tests are 
flaky on Flink runner in Python PostCommit suites (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job  (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-9232: 
BigQueryWriteIntegrationTests is flaky coercing to Unicode (created 2020-01-31)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed) (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7992: Unhandled type_constraint 
in 
apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
 (created 2019-08-16)
https://issues.apache.org/jira/browse/BEAM-7827: 
MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on 
DirectRunner (created 2019-07-26)
https://issues.apache.org/jira/browse/BEAM-7752: Java Validates 
DirectRunner: testTeardownCalledAfterExceptionInFinishBundleStateful flaky 
(created 2019-07-16)
https://issues.apache.org/jira/browse/BEAM-6804: [beam_PostCommit_Java] 
[PubsubReadIT.testReadPublicData] Timeout waiting on Sub (created 2019-03-11)
https://issues.apache.org/jira/browse/BEAM-5286: 
[beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
 .sh script: text file busy. (created 2018-09-01)

Re: Protobuf schema provider row functions break on CamelCased field names

2021-08-10 Thread Chris Hinds
I’ve made a bad job of the PR, I’m really sorry. This is my proposed commit:

https://github.com/apache/beam/pull/15306/commits/0f5324321aea01ea035a6e9c46dc80faf1b078f5


On 10 Aug 2021, at 18:15, Reuven Lax 
mailto:re...@google.com>> wrote:

Definitely happy to look at a PR.

On Tue, Aug 10, 2021 at 10:11 AM Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:
I created an issue for this: https://issues.apache.org/jira/browse/BEAM-12736

I also took a stab at a fix. Would you accept a pull request? Or, I'd be happy 
to discuss.

Cheers,
Chris.


On 9 Aug 2021, at 21:02, Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:

Haha, it probably shouldn’t!

I’ll have a look.

C.

On 9 Aug 2021, at 19:43, Reuven Lax mailto:re...@google.com>> 
wrote:


I didn't even realize that the proto compiler allowed camel-case field names! I 
think the only option would be for someone to add support for camel-case proto 
field names in Beam.

On Mon, Aug 9, 2021 at 10:57 AM Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:
Hi,

I get an IllegalArgumentException when I call a row function against a proto 
instance.

SerializableFunction myRowFunction = new ProtoMessageSchema().toRowFunction(new 
TypeDescriptor() {});
MyDataModel.ProtoPayload payload = …
Row row = (Row) myRowFunction.apply(payload);

It looks like there is some field name mapping down in 
ProtoByteBuddyUtils.getProtoGetter() and protoGetterName(...) which assumes 
proto field names are snake_case, when it tries to map row schema names into 
proto getter names.

I appreciate proto best practice is to have snake_case field names. Sadly, my 
proto field names are lowerCamel, and the above lookup fails.

Short of renaming every Proto definition, is there anything I can do?

Cheers,
Chris.




P1 issues report (42)

2021-08-10 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-12712: Java Portable VR 
PostCommits failing - ParDoTest TimerTests.testEventTimeTimerLoop (created 
2021-08-02)
https://issues.apache.org/jira/browse/BEAM-12699: Several streaming tests 
failing - PostCommit_Java_VR_Dataflow_V2_Streaming, Python PreCommit,  (created 
2021-07-30)
https://issues.apache.org/jira/browse/BEAM-12683: PythonPostCommits 
RecommendationAIIT.test_create_catalog_item failing (created 2021-07-29)
https://issues.apache.org/jira/browse/BEAM-12678: 
beam_PreCommit_GoPortable_Phrase failing to start the local job server (created 
2021-07-29)
https://issues.apache.org/jira/browse/BEAM-12632: ElasticsearchIO: Enabling 
both User/Pass auth and SSL overwrites User/Pass (created 2021-07-16)
https://issues.apache.org/jira/browse/BEAM-12607: Copy Code Snippet copies 
html tags (created 2021-07-13)
https://issues.apache.org/jira/browse/BEAM-12603: Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
 (created 2021-07-12)
https://issues.apache.org/jira/browse/BEAM-12525: SDF BoundedSource seems 
to execute significantly slower than 'normal' BoundedSource (created 2021-06-22)
https://issues.apache.org/jira/browse/BEAM-12500: Dataflow SocketException 
(SSLException) error while trying to send message from Cloud Pub/Sub to 
BigQuery (created 2021-06-16)
https://issues.apache.org/jira/browse/BEAM-12484: JdbcIO date conversion is 
sensitive to OS (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12467: 
java.io.InvalidClassException With Flink Kafka (created 2021-06-09)
https://issues.apache.org/jira/browse/BEAM-12436: 
[beam_PostCommit_Go_VR_flink| beam_PostCommit_Go_VR_spark] 
[:sdks:go:test:flinkValidatesRunner] Failure summary (created 2021-06-01)
https://issues.apache.org/jira/browse/BEAM-12380: Go SDK Kafka IO Transform 
implemented via XLang (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12374: Spark postcommit failing 
ResumeFromCheckpointStreamingTest (created 2021-05-20)
https://issues.apache.org/jira/browse/BEAM-12310: 
beam_PostCommit_Java_DataflowV2 failing (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)
https://issues.apache.org/jira/browse/BEAM-11828: JmsIO is not 
acknowledging messages correctly (created 2021-02-17)
https://issues.apache.org/jira/browse/BEAM-11755: Cross-language 
consistency (RequiresStableInputs) is quietly broken (at least on portable 
flink runner) (created 2021-02-05)
https://issues.apache.org/jira/browse/BEAM-11578: `dataflow_metrics` 
(python) fails with TypeError (when int overflowing?) (created 2021-01-06)
https://issues.apache.org/jira/browse/BEAM-11148: Kafka 
commitOffsetsInFinalize OOM on Flink (created 2020-10-28)
https://issues.apache.org/jira/browse/BEAM-11017: Timer with dataflow 
runner can be set multiple times (dataflow runner) (created 2020-10-05)
https://issues.apache.org/jira/browse/BEAM-10670: Make non-portable 
Splittable DoFn the only option when executing Java "Read" transforms (created 
2020-08-10)
https://issues.apache.org/jira/browse/BEAM-10617: python 
CombineGlobally().with_fanout() cause duplicate combine results for sliding 
windows (created 2020-07-31)
https://issues.apache.org/jira/browse/BEAM-10569: SpannerIO tests don't 
actually assert anything. (created 2020-07-23)
https://issues.apache.org/jira/browse/BEAM-10529: Kafka XLang fails for 
?empty? key/values (created 2020-07-18)
https://issues.apache.org/jira/browse/BEAM-10288: Quickstart documents are 
out of date (created 2020-06-19)
https://issues.apache.org/jira/browse/BEAM-10244: Populate requirements 
cache fails on poetry-based packages (created 2020-06-11)
https://issues.apache.org/jira/browse/BEAM-10100: FileIO writeDynamic with 
AvroIO.sink not writing all data (created 2020-05-27)

P0 (outage) report

2021-08-10 Thread Beam Jira Bot
This is your daily summary of Beam's current outages. See 
https://beam.apache.org/contribute/jira-priorities/#p0-outage for the meaning 
and expectations around P0 issues.

BEAM-12723: beam_PreCommit_Java_Examples_Dataflow_Phrase failing due to key 
negotiation error (https://issues.apache.org/jira/browse/BEAM-12723)
BEAM-12713: beam_PostCommit_NightlySnapshot runMobileGamingJavaDataflow 
failing (https://issues.apache.org/jira/browse/BEAM-12713)
BEAM-12676: StreamingWordCountIT is failing for Python PreCommit  
(https://issues.apache.org/jira/browse/BEAM-12676)


Re: Protobuf schema provider row functions break on CamelCased field names

2021-08-10 Thread Chris Hinds
I created an issue for this: https://issues.apache.org/jira/browse/BEAM-12736

I also took a stab at a fix. Would you accept a pull request? Or, I'd be happy 
to discuss.

Cheers,
Chris.


On 9 Aug 2021, at 21:02, Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:

Haha, it probably shouldn’t!

I’ll have a look.

C.

On 9 Aug 2021, at 19:43, Reuven Lax mailto:re...@google.com>> 
wrote:


I didn't even realize that the proto compiler allowed camel-case field names! I 
think the only option would be for someone to add support for camel-case proto 
field names in Beam.

On Mon, Aug 9, 2021 at 10:57 AM Chris Hinds 
mailto:chris.hi...@bdi.ox.ac.uk>> wrote:
Hi,

I get an IllegalArgumentException when I call a row function against a proto 
instance.

SerializableFunction myRowFunction = new ProtoMessageSchema().toRowFunction(new 
TypeDescriptor() {});
MyDataModel.ProtoPayload payload = …
Row row = (Row) myRowFunction.apply(payload);

It looks like there is some field name mapping down in 
ProtoByteBuddyUtils.getProtoGetter() and protoGetterName(...) which assumes 
proto field names are snake_case, when it tries to map row schema names into 
proto getter names.

I appreciate proto best practice is to have snake_case field names. Sadly, my 
proto field names are lowerCamel, and the above lookup fails.

Short of renaming every Proto definition, is there anything I can do?

Cheers,
Chris.



Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
Thanks for your responses Luke. One point I have confusion over:

* Modify the sink implementation to do what you want with the bad data and
> update the pipeline.
>

 I modified the sink implementation to ignore the specific error that was
the problem and updated the pipeline. The update succeeded.  However, the
problem persisted.  How might that happen?  Is there caching involved?
Checkpointing? I changed the very last method called in the pipeline in
order to ensure the validation would apply, but the problem persisted.

And in the case where one is using a Sink which is a built-in IO module of
Beam, modification of the Sink may not be immediately feasible. Is the only
recourse in that case to drain a job an start a new one?

On Tue, Aug 10, 2021 at 12:54 PM Luke Cwik  wrote:

>
>
> On Tue, Aug 10, 2021 at 8:54 AM Evan Galpin  wrote:
>
>> Hi all,
>>
>> I recently had an experience where a streaming pipeline became "clogged"
>> due to invalid data reaching the final step in my pipeline such that the
>> data was causing a non-transient error when writing to my Sink.  Since the
>> job is a streaming job, the element (bundle) was continuously retrying.
>>
>> What options are there for getting out of this state when it occurs?
>>
>
> * Modify the sink implementation to do what you want with the bad data and
> update the pipeline.
> * Cancel the pipeline and update the implementation to handle the bad
> records and rerun from last known good position reprocessing whatever was
> necessary.
>
> I attempted to add validation and update the streaming job to remove the
>> bad entity; though the update was successful, I believe the bad entity was
>> already checkpointed (?) further downstream in the pipeline. What then?
>>
> And for something like a database schema and evolving it over time, what
>> is the typical solution?
>>
>
> * Pipeline update containing newly updated schema before data with new
> schema starts rolling in.
> * Use a format and encoding that is agnostic to changes with a source/sink
> that can consume this agnostic format. See this thread[1] and others like
> it in the user and dev mailing lists.
>
>
>> - Should pipelines mirror a DB schema and do validation of all data types
>> in the pipeline?
>>
>
> Perform validation at critical points in the pipeline like data ingress
> and egress. Insertion of the data into the DB failing via a dead letter
> queue works for the cases that are loud and throw exceptions but for the
> cases where they are inserted successfully but are still invalid from a
> business logic standpoint won't be caught without validation.
>
>
>> - Should all sinks implement a way to remove non-transient failures from
>> retrying and output them via PCollectionTuple (such as with BigQuery failed
>> inserts)?
>>
>
> Yes, dead letter queues (DLQs) are quite a good solution for this since it
> provides a lot of flexibility and allows for a process to fix it up
> (typically a manual process).
>
> 1:
> https://lists.apache.org/thread.html/r4b31c8b76fa81dcb130397077b981ab6429f2999b9d864c815214c5a%40%3Cuser.beam.apache.org%3E
>


Contributor permission for beam Jira tickets

2021-08-10 Thread Chris Hinds
Hello,

I’ve been dipping in an out of Beam as a user for a while. I came across an 
issue and I’d like to assign it to myself (username: chrishinds). 

Thanks!
Chris.

[Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
Hi all,

I recently had an experience where a streaming pipeline became "clogged"
due to invalid data reaching the final step in my pipeline such that the
data was causing a non-transient error when writing to my Sink.  Since the
job is a streaming job, the element (bundle) was continuously retrying.

What options are there for getting out of this state when it occurs? I
attempted to add validation and update the streaming job to remove the bad
entity; though the update was successful, I believe the bad entity was
already checkpointed (?) further downstream in the pipeline. What then?

And for something like a database schema and evolving it over time, what is
the typical solution?

- Should pipelines mirror a DB schema and do validation of all data types
in the pipeline?
- Should all sinks implement a way to remove non-transient failures from
retrying and output them via PCollectionTuple (such as with BigQuery failed
inserts)?


Re: Help needed with migration of GitHub Action Runners from GitHub to GKE.

2021-08-10 Thread Fernando Morales Martinez
Yeah, it totally works for me!

Thanks again

- Fer

On Mon, Aug 9, 2021 at 5:20 PM Pablo Estrada  wrote:

> Hi Fernando,
> by any chance, could we make this 30 minutes earlier? A conflict came up
> that's a little hard to move.
>
> On Mon, Aug 9, 2021 at 8:03 AM Fernando Morales Martinez <
> fernando.mora...@wizeline.com> wrote:
>
>> Hi Jarek!
>> Thanks a lot for the insight and resources you shared with us. I will add
>> that to the document I'm drafting containing the changes needed for this
>> effort to work.
>>
>> Pablo,
>> I will go through these documents today, and let's say we have an initial
>> call on Wednesday?
>>
>> Sharing the call info here in case someone else would like to join.
>>
>> Wednesday, August 11 · 2:00 – 3:00pm EST
>> Google Meet joining info
>> Video call link: https://meet.google.com/xsd-azwb-rxz
>> More phone numbers: https://tel.meet/xsd-azwb-rxz?pin=7716629663982
>>
>> Thanks again, Pablo and Jarek!
>>
>> - Fer
>>
>>
>> On Thu, Aug 5, 2021 at 11:37 AM Jarek Potiuk  wrote:
>>
>>> I'd love to help, but I am on vacation next week, just one word of
>>> warning. If you want to run GitHub Runner on your own infrastructure, that
>>> might introduce several security risks.
>>>
>>> Basically anyone who makes a PR to your repo can compromise your
>>> runners. The dangers of compromising runners are explained here:
>>> https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#potential-impact-of-a-compromised-runner
>>> And Github still very very strongly discourages people to use
>>> self-hosted runners for Public repositories:
>>>
>>>
>>> https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#hardening-for-self-hosted-runners
>>>
>>> Self-hosted runners on GitHub do not have guarantees around running in
>>> ephemeral clean virtual machines, and can be persistently compromised by
>>> untrusted code in a workflow.
>>> As a result, self-hosted runners should almost never be used for public
>>> repositories on GitHub, because any user can open pull requests against the
>>> repository and compromise the environment.
>>>
>>> We've workarounded it in Airflow and we run our self-hosted runners on
>>> Amazon, however we have a patched version of runner:
>>> https://github.com/ashb/runner. We needed to patch it as the original
>>> runner cannot reject builds started from "unknown" users.
>>> Which is automatically rebased and patch re-applied whenever GitHub
>>> releases their new runner (the old versions stop working ~ few days after
>>> new version is released). We keep a list of allowed people ("committers")
>>> who can utilise such runners.
>>> We also maintain list of maintainers in our workflow that are "routed"
>>> to self-hosted runners:
>>> https://github.com/apache/airflow/blob/1bd3a5c68c88cf3840073d6276460a108f864187/.github/workflows/ci.yml#L86
>>>
>>> Those are a bit of hacks (until GitHub implements proper support for
>>> it), but if you want to avoid people using your runners to mine bitcoin, or
>>> stealing secrets or potentially being able to modify your repository
>>> without you knowing it, similar steps should be taken.
>>>
>>> J.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 5, 2021 at 7:17 PM Pablo Estrada  wrote:
>>>
 That works for me! Maybe create a calendar invite and share it in this
 thread.
 Best
 -P.

 On Thu, Aug 5, 2021, 8:36 AM Fernando Morales Martinez <
 fernando.mora...@wizeline.com> wrote:

> Thanks for the help, Pablo!
> I'm also available most of Monday, Tuesday and Wednesday; how about we
> set the meeting for Tuesday August 10th 2pm EST? In case someone 
> interested
> can't make it, we can adjust accordingly.
> Thanks again!
> -Fer
>
> On Wed, Aug 4, 2021 at 5:55 PM Pablo Estrada 
> wrote:
>
>> Hello Fernando!
>> The people that built the GKE cluster infrastructure have not been
>> involved with Beam for a while.
>> I think you can set a time that is convenient for you, and invite
>> others to participate - and we'll all figure it out together.
>> I'm available most of Monday, Tuesday, Wednesday of next week. I'll
>> be happy to jump on a call and be confused with you (and maybe others 
>> will
>> too).
>> Best
>> -P.
>>
>> On Tue, Aug 3, 2021 at 11:20 AM Fernando Morales Martinez <
>> fernando.mora...@wizeline.com> wrote:
>>
>>> Hi everyone,
>>> As part of the work done to migrate the GitHub Actions runners over
>>> to GKE, the, not exhaustive, changes below were performed:
>>>
>>> - added a new secret to the apache-beam-testing project (this is the
>>> PAT needed by the docker image to connect to GitHub)
>>> - added a new docker image (which will execute the test flows still
>>> being ran by GitHub)
>>>
>>> and other changes which you can take a look at here:
>>>