Re: [PROPOSAL] Re-enable checkerframework by default

2022-10-21 Thread Kenneth Knowles
https://github.com/apache/beam/pull/23792

On Fri, Oct 21, 2022 at 9:49 AM Reuven Lax via dev 
wrote:

> +1
>
> This happens to me regularly. It fails on Jenkins but succeeds on my
> machine, and it's hard to figure out why (since all you see on Jenkins is a
> compile error). Then I'm always trying to remember how to enable it
> locally. IMO development would be faster if this was enabled locally.
> Anyone who doesn't like it can always disable it for their local compiles.
>
> Reuven
>
> On Fri, Oct 21, 2022 at 8:38 AM Alexey Romanenko 
> wrote:
>
>> +1 to make it “on" by default with mentioning that on Contribution Guide.
>>
>> I recall for one PR that it took me some time to realise why it was
>> failing on Jenkins and not locally because of this different behaviour.
>>
>> —
>> Alexey
>>
>> > On 20 Oct 2022, at 00:51, Kenneth Knowles  wrote:
>> >
>> > Hi all,
>> >
>> > Some time ago we turned off checker framework locally by default, and
>> only turn it on with `-PenableCheckerFramework` and also on Jenkins.
>> >
>> > My opinion is that this causes more headache than it solves, by
>> delaying finding out about errors. The increased compilation time of
>> checkerframework is real. But during iteration almost every step of a
>> compile is cached so it only matters specifically for :sdks:java:core. My
>> take is that anyone editing that is probably experienced enough with Beam
>> to know they can turn it off. So I propose we turn it on by default, with
>> the option to disable it.
>> >
>> > Kenn
>>
>>


Re: Questions on primitive transforms hierarchy

2022-10-21 Thread Kenneth Knowles
On Fri, Oct 21, 2022 at 5:24 AM Jan Lukavský  wrote:

> Hi,
>
> I have some missing pieces in my understanding of the set of Beam's
> primitive transforms, which I'd like to fill. First a quick recap of what I
> think is the current state. We have (basically) the following primitive
> transforms:
>
>  - DoFn (stateless, stateful, splittable)
>
>  - Window
>
>  - Impulse
>
>  - GroupByKey
>
>  - Combine
>

Not a primitive, just a well-defined transform that runners can execute in
special ways.


>
>
>  - Flatten (pCollections)
>

The rest, yes.



> Inside runners, we most often transform GBK into ReduceFn
> (ReduceFnRunner), which does the actual logic for both GBK and stateful
> DoFn.
>

ReduceFnRunner is for windowing / triggers and has special feature to use a
CombineFn while doing it. Nothing to do with stateful DoFn.



> I'll compare this to the set of transforms we used to use in Euphoria
> (currently java SDK extension):
>
>  - FlatMap ~~ stateless DoFn
>
>  - Union ~~ Flatten
>
>  - ReduceStateByKey ~~ stateful DoFn, GBK, Combine, Window
>

Stateful DoFn does not require associative or commutative operation, while
reduce/combine does. Windowing is really just a secondary key for
GBK/Combine that allows completion of unbounded aggregations but has no
computation associated with it.



>  - (missing Impulse)
>

Then you must have some primitive sources with splitting?


>  - (missing splittable DoFn)
>

Kind of the same question - SDF is the one and only primitive that creates
parallelism.

The ReduceStateByKey is a transform that is a "combinable stateful DoFn" -
> i.e. the state might be created pre-shuffle, on trigger the state is
> shuffled and then merged. In Beam we already have CombiningState and
> MergingState facility (sort of), which is what is needed, we just do not
> have the ability to shuffle the partial states and then combine them. This
> also relates to the inability to run stateful DoFn for merging windowFns,
> because that is needed there as well. Is this something that is
> fundamentally impossible to define for all runners? What is worth noting is
> that building, shuffling and merging the state before shuffle requires
> compatible trigger (purely based on watermark), otherwise the transform
> fall-backs to "classical DoFn".
>

Stateful DoFn for merging windows can be defined. You could require all
state to be mergeable and then it is automatic. Or you could have an
"onMerge" callback. These should both be fine. The automatic version is
less likely to have nonsensical semantics, but allowing the callback to do
"whatever it wants" whether the result is good or not is more consistent
with the design of stateful DoFn.

Whether and where a shuffle takes place may vary. Start with the maths.

Kenn


> Bottom line: I'm thinking of proposing to drop Euphoria extension, because
> it has essentially no users and actually no maintainers, but I have a
> feeling there is a value in the set of operators that could be transferred
> to Beam core, maybe. I'm pretty sure it would bring value to users to have
> access to a "combining stateful DoFn" primitive (even better would be
> "combining splittable DoFn").
>
> Looking forward to any comments on this.
>
>  Jan
>
>
>


Re: FOSDEM 2023 is back as in person event

2022-10-21 Thread Ismaël Mejía
Hi Aizhamal,

You might be interested on this thread where the ASF people are also
discussing about FOSDEM participation.
https://lists.apache.org/thread/kv4fhldmc9mo6v5lwtkwqtwg97l64lx1

It seems the call for devrooms is closed so maybe it us too late for
Beam, but we have had talks in the past about Beam as part of the Big
Data track so maybe worth to participate there.

Best,
Ismaël

On Mon, Oct 17, 2022 at 9:06 PM Aizhamal Nurmamat kyzy
 wrote:
>
> Hi Beam community!
>
> FOSDEM 2023  is back as an in person event! I have 
> heard only great things about the event where thousands of developers get 
> together to talk all about open source!
>
> Is anyone from the Beam community planning to attend? The event takes place 
> in Brussels on February 4 & 5, 2023. I believe it is also free to attend but 
> don't quote me on this.
>
> As an open source project we can also have
> - a stand for free https://fosdem.org/2023/news/2022-09-26-stands-cfp/
> - a Devroom https://fosdem.org/2023/news/2022-09-29-call_for_devrooms/
>
> Anyone interested?
>


Re: [DISCUSS] Jenkins -> GitHub Actions ?

2022-10-21 Thread Ismaël Mejía
+1 Github Actions are more intuitive and easy to modify and test for everyone.
Also Beam wins because that makes one less system to maintain.

Regards,
Ismaël

On Wed, Oct 19, 2022 at 5:50 PM Danny McCormick via dev
 wrote:
>
> Thanks for kicking this conversation off. I'm +1 on migrating, but only once 
> we've found a specific replacement for easy observability (which workflows 
> have been failing lately, and how often) and trigger phrases (for retries and 
> workflows that aren't automatically kicked off but should be run for extra 
> validation, e.g. postcommits). Until we have viable replacements, I don't 
> think we should make the move. Publishing nightly snapshots is eventually 
> also a must to fully migrate, but probably doesn't need to block us from 
> making progress here.
>
> With those caveats, the reason that I'm +1 on moving is that our Jenkins 
> reliability has been rough. Since I joined the project in January, I can 
> think of 3 different incidents that significantly harmed our ability to do 
> work.
>
> 1. Jenkins triggers cause multi-day outage - this led to a multi-day code 
> freeze, and we lost our trigger functionality for days afterwards. 
> Investigating/restoring our state ate up a pretty full week for me.
> 2. Jenkins plugin cause multi-day outage - this led to multiple days of 
> Jenkins downtime before eventually being resolved by Infra.
> 3. Cert issues cause many workers to go down - I don't have a thread for this 
> because I handled most of the investigation the day of, but many of our 
> workers went down for around a day and nobody noticed until queue time 
> reached 6+ hours for each workflow.
>
> There may be others that I'm overlooking.
>
> GitHub Actions isn't a magic bullet to fix these problems, but it minimizes 
> the amount of infra that we're maintaining ourselves, increases the isolation 
> between workflows (catastrophic failure is less likely), has uptime 
> guarantees, and is more likely to receive investment going forward (we're 
> likely to get increasing benefits over time for free). We've also done a lot 
> of exploration in this area already, so we're not starting from scratch.
>
> Thanks,
> Danny
>
> On Wed, Oct 19, 2022 at 11:32 AM Kenneth Knowles  wrote:
>>
>> Hi all,
>>
>> As you probably noticed, there's a lot of work going on around adding more 
>> GitHub Actions workflows.
>>
>> Can we fully migrate to GitHub Actions? Similar to our GitHub Issues 
>> migration (but less user-facing) it would bring us on to "default" 
>> infrastructure that more people understand and is maintained by GitHub.
>>
>> So far we have hit some serious roadblocks. It isn't just a simple 
>> migration. We have to weigh doing the work to get there.
>>
>> I started a document with a table of the things we get from Jenkins that we 
>> need to be sure to have for GitHub Actions before we could think about 
>> migrating:
>>
>> https://s.apache.org/beam-jenkins-to-gha
>>
>> Can you please help me by adding things that we get from Jenkins, and if you 
>> know how to get them from GitHub Actions add that too.
>>
>> Thanks!
>>
>> Kenn


Re: [PROPOSAL] Re-enable checkerframework by default

2022-10-21 Thread Reuven Lax via dev
+1

This happens to me regularly. It fails on Jenkins but succeeds on my
machine, and it's hard to figure out why (since all you see on Jenkins is a
compile error). Then I'm always trying to remember how to enable it
locally. IMO development would be faster if this was enabled locally.
Anyone who doesn't like it can always disable it for their local compiles.

Reuven

On Fri, Oct 21, 2022 at 8:38 AM Alexey Romanenko 
wrote:

> +1 to make it “on" by default with mentioning that on Contribution Guide.
>
> I recall for one PR that it took me some time to realise why it was
> failing on Jenkins and not locally because of this different behaviour.
>
> —
> Alexey
>
> > On 20 Oct 2022, at 00:51, Kenneth Knowles  wrote:
> >
> > Hi all,
> >
> > Some time ago we turned off checker framework locally by default, and
> only turn it on with `-PenableCheckerFramework` and also on Jenkins.
> >
> > My opinion is that this causes more headache than it solves, by delaying
> finding out about errors. The increased compilation time of
> checkerframework is real. But during iteration almost every step of a
> compile is cached so it only matters specifically for :sdks:java:core. My
> take is that anyone editing that is probably experienced enough with Beam
> to know they can turn it off. So I propose we turn it on by default, with
> the option to disable it.
> >
> > Kenn
>
>


Re: [PROPOSAL] Re-enable checkerframework by default

2022-10-21 Thread Alexey Romanenko
+1 to make it “on" by default with mentioning that on Contribution Guide.

I recall for one PR that it took me some time to realise why it was failing on 
Jenkins and not locally because of this different behaviour. 

—
Alexey

> On 20 Oct 2022, at 00:51, Kenneth Knowles  wrote:
> 
> Hi all,
> 
> Some time ago we turned off checker framework locally by default, and only 
> turn it on with `-PenableCheckerFramework` and also on Jenkins.
> 
> My opinion is that this causes more headache than it solves, by delaying 
> finding out about errors. The increased compilation time of checkerframework 
> is real. But during iteration almost every step of a compile is cached so it 
> only matters specifically for :sdks:java:core. My take is that anyone editing 
> that is probably experienced enough with Beam to know they can turn it off. 
> So I propose we turn it on by default, with the option to disable it.
> 
> Kenn



Questions on primitive transforms hierarchy

2022-10-21 Thread Jan Lukavský

Hi,

I have some missing pieces in my understanding of the set of Beam's 
primitive transforms, which I'd like to fill. First a quick recap of 
what I think is the current state. We have (basically) the following 
primitive transforms:


 - DoFn (stateless, stateful, splittable)

 - Window

 - Impulse

 - GroupByKey

 - Combine

 - Flatten (pCollections)


Inside runners, we most often transform GBK into ReduceFn 
(ReduceFnRunner), which does the actual logic for both GBK and stateful 
DoFn.


I'll compare this to the set of transforms we used to use in Euphoria 
(currently java SDK extension):


 - FlatMap ~~ stateless DoFn

 - Union ~~ Flatten

 - ReduceStateByKey ~~ stateful DoFn, GBK, Combine, Window

 - (missing Impulse)

 - (missing splittable DoFn)


The ReduceStateByKey is a transform that is a "combinable stateful DoFn" 
- i.e. the state might be created pre-shuffle, on trigger the state is 
shuffled and then merged. In Beam we already have CombiningState and 
MergingState facility (sort of), which is what is needed, we just do not 
have the ability to shuffle the partial states and then combine them. 
This also relates to the inability to run stateful DoFn for merging 
windowFns, because that is needed there as well. Is this something that 
is fundamentally impossible to define for all runners? What is worth 
noting is that building, shuffling and merging the state before shuffle 
requires compatible trigger (purely based on watermark), otherwise the 
transform fall-backs to "classical DoFn".


Bottom line: I'm thinking of proposing to drop Euphoria extension, 
because it has essentially no users and actually no maintainers, but I 
have a feeling there is a value in the set of operators that could be 
transferred to Beam core, maybe. I'm pretty sure it would bring value to 
users to have access to a "combining stateful DoFn" primitive (even 
better would be "combining splittable DoFn").


Looking forward to any comments on this.

 Jan



Beam High Priority Issue Report (43)

2022-10-21 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P0 Issues:

https://github.com/apache/beam/issues/23747 [Bug]: After JDBCIO read 
withRowOutput(), the VARCHAR/TEXT -> LOGICAL_TYPE and not compatible with 
SqlTypeName


Unassigned P1 Issues:

https://github.com/apache/beam/issues/23768 [Bug]: 
beam_PostCommit_Py_VR_Dataflow is failing in 
`translations_test.TranslationsTest.test_run_packable_combine_{globally,limit}`
https://github.com/apache/beam/issues/23745 [Bug]: Samza 
AsyncDoFnRunnerTest.testSimplePipeline is flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23693 [Bug]: apache_beam.io.kinesis 
module READ_DATA_URN mismatch
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21561 
ExternalPythonTransformTest.trivialPythonTransform flaky
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21364 Flink load tests fail: 
NoClassDefFoundError: MessageBodyReader
https://github.com/apache/beam/issues/21333 Flink testParDoRequiresStableInput 
flaky
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21123 Multiple jobs running on Flink 
session cluster reuse the persistent Python environment.
https://github.com/apache/beam/issues/21113 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest is 
flaky
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20975 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20655 Flink PortableValidatesRunner test 
failure: GroupByKeyTest$BasicTests.testLargeKeys10MB
https://github.com/apache/beam/issues/20269 Flink postcommits failing 
testFlattenWithDifferentInputAndOutputCoders2
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19241 Python Dataflow integration tests 
should export the pipeline Job ID and console output to Jenkins Test Result 
section


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/23489 [Bug]: add DebeziumIO to the 
connectors page
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22891 [Bug]: 
beam_PostCommit_XVR_PythonUsingJavaDataflow is flaky
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/22011 [Bug]: 
org.apache.beam.sdk.io.aws2.kinesis.KinesisIOWriteTest.testWriteFailure flaky
https://github.com/apache/beam/issues/21893 [Bug]: BigQuery Storage Write API 
implementation does not support table partitioning
https://github.com/apache/beam/issues/21711 Python Streaming job failing to 
drain with BigQueryIO write errors
https://github.com/apache/beam/issues/21709 
beam_PostCommit_Java_ValidatesRunner_Samza Failing
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21707 GroupByKeyTest BasicTests 
testLargeKeys100MB flake (on ULR)