Regarding resources: the runner can currently dictate the mem/cpu/disk
resources that the harness is allowed to use via the provisioning api. The
SDK harness need not -- and should not -- speculate on what else might be
running on the machine:
I am thinking upper bound to be more on the lines of theocratical upper
limit or any other static high value beyond which the SDK will reject
bundle verbosely. The idea is that SDK will not keep bundles in queue while
waiting on current bundles to finish. It will simply reject any additional
Ankur, how would you expect an SDK to compute a realistic upper bound
(upfront or during pipeline computation)?
First thought that came to my mind was that the SDK would provide
CPU/memory/... resourcing information and the runner making a judgement
call as to whether it should ask the SDK to do
I agree with Luke's observation, with the caveat that "infinite amount of
bundles in parallel" is limited by the available resources. For example,
the Go SDK harness will accept an arbitrary amount of parallel work, but
too much work will cause either excessive GC pressure with crippling
slowness
I forwarded your request to a few people who work on the internal parts of
Dataflow to see if they could help in some way.
On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot
wrote:
> Hi all
>
> As we already discussed, it would be good to support Metrics Pusher [1] in
> Dataflow (in other runners
The later case of having a of supporting single bundle execution at a time
on SDK and runner not using this flag is exactly the reason we got into the
Dead Lock here.
I agree with exposing SDK optimum concurrency level ( 1 in later case ) and
let runner decide to use it or not. But at the same
I believe in practice SDK harnesses will fall into one of two capabilities,
can process effectively an infinite amount of bundles in parallel or can
only process a single bundle at a time.
I believe it is more difficult for a runner to handle the latter case well
and to perform all the
To recap the discussion it seems that we have come-up with following point.
SDKHarness Management and initialization.
1. Runner completely own the work assignment to SDKHarness.
2. Runner should know the capabilities and capacity of SDKHarness and
should assign work accordingly.
3.
Finding a good balance is indeed the art of portability, because the range
of capability (and assumptions) on both sides is wide.
It was originally the idea to allow the SDK harness to be an extremely
simple bundle executer (specifically, single-threaded execution one
instruction at a time)
Thanks Robert. You raise a good point that this code is
performance-critical.
If the check can be fast, then its worth having.
> Should we also let Beam error out if users return a string?
> > e.g. consider the following pipeline:
> > p | Create(['abc']) | ParDo(lambda x: x) |
Beam Python expects DoFns to return an iterable that contains the actual
output elements. This is documented, and visible in examples, but it is
also a bit counter-intuitive.
We should definitely add a check in _OutputProcessor[1] to throw a more
expressive error if it receives a non-iterable.
Thanks for the invite, and congrats for the event.
By the way, I will share my slides with you (sorry I forgot).
Regards
JB
On 17/08/2018 19:36, Arianne Lisset Navarro Lepe wrote:
> Great event, great results =) .. We are preparing a blog about the Open
> Source Challenge and will share with
Great event, great results =) .. We are preparing a blog about the Open Source Challenge and will share with you in the upcoming days..
Thanks @Gris for coming to Guadalajara to support the two Beam teams participating in the challenge, your support with presentation and in overall the event
It is good to see this discussed!
I think there needs to be a good balance between the SDK harness
capabilities/complexity and responsibilities. Additionally the user will
need to be able to adjust the runner behavior, since the type of workload
executed in the harness also is a factor. Elsewhere
Hello everyone,
this has finished today. The two teams that worked in Beam prepared 2 PRs
(6241, 6242), made one code change in a branch without a PR, and proposed
solutions for a couple more issues in JIRA.
These PRs and changes will have to be iterated on, but we'll be able to
incorporate their
Thank you!
On Fri, Aug 17, 2018 at 9:44 AM Thomas Weise wrote:
> Anton, you should be all set.
>
> On Fri, Aug 17, 2018 at 9:11 AM Anton Kedin wrote:
>
>> Sure, I can do that.
>> Can someone give me permissions?
>>
>> Thank you,
>> Anton
>>
>> On Fri, Aug 17, 2018 at 12:32 AM Etienne Chauchot
Anton, you should be all set.
On Fri, Aug 17, 2018 at 9:11 AM Anton Kedin wrote:
> Sure, I can do that.
> Can someone give me permissions?
>
> Thank you,
> Anton
>
> On Fri, Aug 17, 2018 at 12:32 AM Etienne Chauchot
> wrote:
>
>> Hi Anton,
>>
>> I was hoping you would say that. Actually I
Sure, I can do that.
Can someone give me permissions?
Thank you,
Anton
On Fri, Aug 17, 2018 at 12:32 AM Etienne Chauchot
wrote:
> Hi Anton,
>
> I was hoping you would say that. Actually I hesitated to add SQL-Nexmark
> and I thought you were more suited to describe it :)
>
> Thanks
> Etienne
>
Most of our documentation talks about the execution model and concepts like
PTransform/PCollection/...[1]
There are guides for contributors related to style/building a
runner/existing design docs[2]
To my knowledge there are no project structure docs.
1:
Have a good time.
On Fri, Aug 17, 2018 at 2:15 AM Łukasz Gajowy
wrote:
> Hi all,
>
> I just wanted to tell that I'll be off on vacation for 2 weeks starting on
> Monday.
>
> Łukasz
>
SDK harnesses were always responsible for executing all work given to it
concurrently. Runners have been responsible for choosing how much work to
give to the SDK harness in such a way that best utilizes the SDK harness.
I understand that multithreading in python is inefficient due to the global
Hi Ankur,
Thanks for looking into this problem. The cause seems to be Flink's
pipelined execution mode. It runs multiple tasks in one task slot and
produces a deadlock when the pipelined operators schedule the SDK
harness DoFns in non-topological order.
The problem would be resolved if we
@Mikhail I must disagree. I have set Nexmark tests as PostCommit mainly because
they were used to discover regressions
on all the beam model as soon as possible: when the new regression commit comes
into master. The graphs (see https://bea
m.apache.org/documentation/sdks/java/nexmark/) help a
Hi Jozef,There is a detailed discussion in the runner agnostic metrics PR about
metrics in Flink in detached mode. Here
it is: https://github.com/apache/beam/pull/4548. See the discussion I had with
Zorro786
BestEtienne
Le mercredi 01 août 2018 à 16:45 +0200, Jozef Vilcek a écrit :
> Hello,
> I
Filed one: https://issues.apache.org/jira/browse/BEAM-5162
On 17.08.18 09:30, Etienne Chauchot wrote:
> Hi,
> I did not plan it but we could. Indeed, When I did the wiki page I
> searched for user doc and I found only the javadoc and the design doc
> from Ben Chambers. Maybe a page on the regular
Hi,I did not plan it but we could. Indeed, When I did the wiki page I searched
for user doc and I found only the
javadoc and the design doc from Ben Chambers. Maybe a page on the regular
website would be needed. You're right, please
fill a jira.
Etienne
Le jeudi 16 août 2018 à 18:24 +0200,
26 matches
Mail list logo