I agree, it would be the easiest way and allow users to switch easily as
well using a single artifact.
Regards
JB
On 29/10/2019 23:54, Kenneth Knowles wrote:
Is it just as easy to have two jars and build an uber jar with both
included? Then the runner can still be toggled with a flag.
Kenn
I would also suggest using Java's Instant since it will be compatible with
many more date/time libraries without forcing onto users the need to go
through an artificial millis/nanos conversion layer to Java's Instant.
On Tue, Oct 29, 2019 at 5:06 PM Robert Bradshaw wrote:
> On Tue, Oct 29, 2019
Thank you, Luke and Robert. Sorry for hitting dev@, I criss-crossed and meant
to hit user@, but as we're here could you clarify your two points, however--
1) I am under the impression that the 4,000 sliding windows approach (30 days
every 10m) will re-evaluate my combine aggregation every 10m
On Tue, Oct 29, 2019 at 4:20 PM Kenneth Knowles wrote:
>
> Point (1) is compelling. Solutions to the "minus epsilon" seem a bit complex.
> On the other hand, an opaque and abstract Timestamp type (in each SDK) going
> forward seems like a Pretty Good Idea (tm). Would you really have to go
>
(acknowledging that ADT is a language level concept and this shows up as
multiple implementations with URNs or some such in the portability layer,
so we can version and name the choices)
On Tue, Oct 29, 2019, 16:20 Kenneth Knowles wrote:
> Point (1) is compelling. Solutions to the "minus
Point (1) is compelling. Solutions to the "minus epsilon" seem a bit
complex. On the other hand, an opaque and abstract Timestamp type (in each
SDK) going forward seems like a Pretty Good Idea (tm). Would you really
have to go floating point? Could you just have a distinguished
representation for
Is it just as easy to have two jars and build an uber jar with both
included? Then the runner can still be toggled with a flag.
Kenn
On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko
wrote:
> Hmm, I don’t think that jar size should play a big role comparing to the
> whole size of shaded jar of
Post-commit runs all precommits. The builder for the Jenkins jobs creates
separate jobs with suffixes:
* _Commit (for when a commit is pushed to a PR)
* _Phrase (for when someone asks to run it)
* _Cron (run as a post-commit against master)
This way, the different jobs have independent
No matter how the problem is structured, computing 30 day aggregations
for every 10 minute window requires storing at least 30day/10min =
~4000 sub-aggregations. In Beam, the elements themselves are not
stored in every window, only the intermediate aggregates.
I second Luke's suggestion to try it
You should first try the obvious answer of using a sliding window of 30
days every 10 minutes before you try the 60 days every 30 days.
Beam has some optimizations which will assign a value to multiple windows
and only process that value once even if its in many windows. If that
doesn't perform
Hi I am new to Beam.
I would like to accumulate data over 30 day period and perform a running
aggregation over this data, say every 10 minutes.
I could use a sliding window of 30 days every 10 minutes (triggering at end
of window) but this seems grossly inefficient (both in terms of # of
windows
Hi, beam dev community,
After seeing some difficulties in troubleshooting python pipelines in
Dataflow, we came up with an idea to expose sdk harness status to runners
through Fn API for better debuggability.
I put up a doc describing the background and proposed approach:
https://github.com/apache/beam/pull/9925
On Tue, Oct 29, 2019 at 10:24 AM Udi Meiri wrote:
>
> I don't have the bandwidth right now to tackle this. Feel free to take it.
>
> On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw wrote:
>>
>> The Python SDK does as well. These calls are coming from
>>
IIRC currently, post-commit doesn't run pre-commits. However we have
precommit_cron jobs that run pre-commits periodically. However it sums up
to dozens of jobs that is really hard to monitor.
If we split things even further, we definitely need to combine result into
something more easily
I don't have the bandwidth right now to tackle this. Feel free to take it.
On Tue, Oct 29, 2019 at 10:16 AM Robert Bradshaw
wrote:
> The Python SDK does as well. These calls are coming from
> to_runner_api, is_stateful_dofn, and validate_stateful_dofn which are
> invoked once per pipene or
> +1 for splitting pre-commit tests into smaller modules. However in this
> case we need to run all the small tests periodically and have some combined
> flag or dashboard for regular monitoring. Otherwise we might not run/check
> on big amount of tests.
>
post-commit seems like the best place
Based upon the current description, from the portability perspective we
could:
Update the timer spec map comment[1] to be:
// (Optional) A mapping of local timer families to timer specifications.
map timer_specs = 5;
And update the timer coder to have the timer id[2]:
// Encodes a timer
+1 for splitting pre-commit tests into smaller modules. However in this
case we need to run all the small tests periodically and have some combined
flag or dashboard for regular monitoring. Otherwise we might not run/check
on big amount of tests.
On Mon, Oct 28, 2019 at 6:39 PM Kenneth Knowles
Hmm, I don’t think that jar size should play a big role comparing to the whole
size of shaded jar of users job. Even more, I think it will be quite confusing
for users to choose which jar to use if we will have 3 different ones for
similar purposes. Though, let’s see what others think.
> On 29
Thank you for advising, Udi and Ahmet.
I'll take a look at the release process.
2019年10月29日(火) 3:47 Ahmet Altay :
>
> Thank you for doing this. It should be possible to run tox as Udi suggested
> and create a PR for review purposes similar to the release process (see:
>
Noting for the benefit of the thread archive in case someone goes digging
and wonders if this affects other SDKs: the Java SDK
memoizes DoFnSignatures and generated DoFnInvoker classes.
Kenn
On Mon, Oct 28, 2019 at 6:59 PM Udi Meiri wrote:
> Re: #9283 slowing down tests, ideas for slowness:
>
tl;dr:
- I see consensus for inferring "http://; in Python to align it with the
Java behavior which currently requires leaving out the protocol scheme.
Optionally, Java could also accept a scheme which gets removed as
required by the Flink Java Rest client.
- We won't support "https://; in
Hi,
+1 for empty string being interpreted as [auto] and anything else having
explicit notation.
One more reason that was not part of this discussion yet. In [1] there
was a discussion about LocalEnvironment (that is the one that is
responsible for spawning in process Flink cluster) not
On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote:
> I didn't get a chance to try this out -- it sounds like a bug with the
> SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>
> From your description, it should be reproducible by reading any large
> database table with the
Hi Reuven,
I didn't propose to restrict the model. Model can (and should have)
multiple timers per key and even dynamic. The question was if this can
be made efficiently by using single timer (after all, the runner will
probably have single "timer service" so no matter what we expose on the
I didn't get a chance to try this out -- it sounds like a bug with the
SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>From your description, it should be reproducible by reading any large
database table with the SparkRunner where the entire dataset is
greater than the memory
I can not find anything in docs about expected behavior of DoFn emitting
arbitrary large number elements on one processElement().
I wonder if Spark Runner behavior is a bug or just a difference (and
disadvantage in this case) in execution more towards runner capability
matrix differences.
Also,
27 matches
Mail list logo