Re: [spark structured streaming runner] merge to master?

2019-10-29 Thread Jean-Baptiste Onofré
I agree, it would be the easiest way and allow users to switch easily as well using a single artifact. Regards JB On 29/10/2019 23:54, Kenneth Knowles wrote: Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag. Kenn

Re: Python SDK timestamp precision

2019-10-29 Thread Luke Cwik
I would also suggest using Java's Instant since it will be compatible with many more date/time libraries without forcing onto users the need to go through an artificial millis/nanos conversion layer to Java's Instant. On Tue, Oct 29, 2019 at 5:06 PM Robert Bradshaw wrote: > On Tue, Oct 29, 2019

Re: aggregating over triggered results

2019-10-29 Thread Aaron Dixon
Thank you, Luke and Robert. Sorry for hitting dev@, I criss-crossed and meant to hit user@, but as we're here could you clarify your two points, however-- 1) I am under the impression that the 4,000 sliding windows approach (30 days every 10m) will re-evaluate my combine aggregation every 10m

Re: Python SDK timestamp precision

2019-10-29 Thread Robert Bradshaw
On Tue, Oct 29, 2019 at 4:20 PM Kenneth Knowles wrote: > > Point (1) is compelling. Solutions to the "minus epsilon" seem a bit complex. > On the other hand, an opaque and abstract Timestamp type (in each SDK) going > forward seems like a Pretty Good Idea (tm). Would you really have to go >

Re: Python SDK timestamp precision

2019-10-29 Thread Kenneth Knowles
(acknowledging that ADT is a language level concept and this shows up as multiple implementations with URNs or some such in the portability layer, so we can version and name the choices) On Tue, Oct 29, 2019, 16:20 Kenneth Knowles wrote: > Point (1) is compelling. Solutions to the "minus

Re: Python SDK timestamp precision

2019-10-29 Thread Kenneth Knowles
Point (1) is compelling. Solutions to the "minus epsilon" seem a bit complex. On the other hand, an opaque and abstract Timestamp type (in each SDK) going forward seems like a Pretty Good Idea (tm). Would you really have to go floating point? Could you just have a distinguished representation for

Re: [spark structured streaming runner] merge to master?

2019-10-29 Thread Kenneth Knowles
Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag. Kenn On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko wrote: > Hmm, I don’t think that jar size should play a big role comparing to the > whole size of shaded jar of

Re: Quota issues again

2019-10-29 Thread Kenneth Knowles
Post-commit runs all precommits. The builder for the Jenkins jobs creates separate jobs with suffixes: * _Commit (for when a commit is pushed to a PR) * _Phrase (for when someone asks to run it) * _Cron (run as a post-commit against master) This way, the different jobs have independent

Re: aggregating over triggered results

2019-10-29 Thread Robert Bradshaw
No matter how the problem is structured, computing 30 day aggregations for every 10 minute window requires storing at least 30day/10min = ~4000 sub-aggregations. In Beam, the elements themselves are not stored in every window, only the intermediate aggregates. I second Luke's suggestion to try it

Re: aggregating over triggered results

2019-10-29 Thread Luke Cwik
You should first try the obvious answer of using a sliding window of 30 days every 10 minutes before you try the 60 days every 30 days. Beam has some optimizations which will assign a value to multiple windows and only process that value once even if its in many windows. If that doesn't perform

aggregating over triggered results

2019-10-29 Thread Aaron Dixon
Hi I am new to Beam. I would like to accumulate data over 30 day period and perform a running aggregation over this data, say every 10 minutes. I could use a sliding window of 30 days every 10 minutes (triggering at end of window) but this seems grossly inefficient (both in terms of # of windows

[discuss] Exposing sdk harness status to runner for better debuggability

2019-10-29 Thread Yichi Zhang
Hi, beam dev community, After seeing some difficulties in troubleshooting python pipelines in Dataflow, we came up with an idea to expose sdk harness status to runners through Fn API for better debuggability. I put up a doc describing the background and proposed approach:

Re: Python Precommit duration pushing 2 hours

2019-10-29 Thread Robert Bradshaw
https://github.com/apache/beam/pull/9925 On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri wrote: > > I don't have the bandwidth right now to tackle this. Feel free to take it. > > On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw wrote: >> >> The Python SDK does as well. These calls are coming from >>

Re: Quota issues again

2019-10-29 Thread Mikhail Gryzykhin
IIRC currently, post-commit doesn't run pre-commits. However we have precommit_cron jobs that run pre-commits periodically. However it sums up to dozens of jobs that is really hard to monitor. If we split things even further, we definitely need to combine result into something more easily

Re: Python Precommit duration pushing 2 hours

2019-10-29 Thread Udi Meiri
I don't have the bandwidth right now to tackle this. Feel free to take it. On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw wrote: > The Python SDK does as well. These calls are coming from > to_runner_api, is_stateful_dofn, and validate_stateful_dofn which are > invoked once per pipene or

Re: Quota issues again

2019-10-29 Thread Chad Dombrova
> +1 for splitting pre-commit tests into smaller modules. However in this > case we need to run all the small tests periodically and have some combined > flag or dashboard for regular monitoring. Otherwise we might not run/check > on big amount of tests. > post-commit seems like the best place

Re: Proposal: Dynamic timer support (BEAM-6857)

2019-10-29 Thread Luke Cwik
Based upon the current description, from the portability perspective we could: Update the timer spec map comment[1] to be: // (Optional) A mapping of local timer families to timer specifications. map timer_specs = 5; And update the timer coder to have the timer id[2]: // Encodes a timer

Re: Quota issues again

2019-10-29 Thread Mikhail Gryzykhin
+1 for splitting pre-commit tests into smaller modules. However in this case we need to run all the small tests periodically and have some combined flag or dashboard for regular monitoring. Otherwise we might not run/check on big amount of tests. On Mon, Oct 28, 2019 at 6:39 PM Kenneth Knowles

Re: [spark structured streaming runner] merge to master?

2019-10-29 Thread Alexey Romanenko
Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think. > On 29

Re: Is there good way to make Python SDK docs draft accessible?

2019-10-29 Thread Yoshiki Obata
Thank you for advising, Udi and Ahmet. I'll take a look at the release process. 2019年10月29日(火) 3:47 Ahmet Altay : > > Thank you for doing this. It should be possible to run tox as Udi suggested > and create a PR for review purposes similar to the release process (see: >

Re: Python Precommit duration pushing 2 hours

2019-10-29 Thread Kenneth Knowles
Noting for the benefit of the thread archive in case someone goes digging and wonders if this affects other SDKs: the Java SDK memoizes DoFnSignatures and generated DoFnInvoker classes. Kenn On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri wrote: > Re: #9283 slowing down tests, ideas for slowness: >

Re: Rethinking the Flink Runner modes

2019-10-29 Thread Maximilian Michels
tl;dr: - I see consensus for inferring "http://; in Python to align it with the Java behavior which currently requires leaving out the protocol scheme. Optionally, Java could also accept a scheme which gets removed as required by the Flink Java Rest client. - We won't support "https://; in

Re: Rethinking the Flink Runner modes

2019-10-29 Thread Jan Lukavský
Hi, +1 for empty string being interpreted as [auto] and anything else having explicit notation. One more reason that was not part of this discussion yet. In [1] there was a discussion about LocalEnvironment (that is the one that is responsible for spawning in process Flink cluster) not

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote: > I didn't get a chance to try this out -- it sounds like a bug with the > SparkRunner, if you've tested it with FlinkRunner and it succeeded. > > From your description, it should be reproducible by reading any large > database table with the

Re: Proposal: Dynamic timer support (BEAM-6857)

2019-10-29 Thread Jan Lukavský
Hi Reuven, I didn't propose to restrict the model. Model can (and should have) multiple timers per key and even dynamic. The question was if this can be made efficiently by using single timer (after all, the runner will probably have single "timer service" so no matter what we expose on the

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Ryan Skraba
I didn't get a chance to try this out -- it sounds like a bug with the SparkRunner, if you've tested it with FlinkRunner and it succeeded. >From your description, it should be reproducible by reading any large database table with the SparkRunner where the entire dataset is greater than the memory

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
I can not find anything in docs about expected behavior of DoFn emitting arbitrary large number elements on one processElement(). I wonder if Spark Runner behavior is a bug or just a difference (and disadvantage in this case) in execution more towards runner capability matrix differences. Also,