Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Chamikara Jayalath via user Wed, 20 Jul 2022 12:58:57 -0700

On Wed, Jul 20, 2022 at 12:57 PM Chamikara Jayalath <[email protected]>
wrote:


> I don't think it's an antipattern per se. You can implement arbitrary
> operations in a DoFn or an SDF to read data.
>
> But if a single resource ID maps to a large amount of data, Beam runners
> (including Dataflow) will be able to parallelize reading, hence your
> solution may have suboptimal performance compared to reading from a Beam
> source that can be fully parallelized.
>

*will not be able to*


>
> Thanks,
> Cham
>
> On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <[email protected]>
> wrote:
>
>> Thank you!
>> I will try this out.
>> One more question on this, is it considered anti-pattern to do HTTP
>> ingestion on GCP Dataflow due to the reasoning I mentioned in my original
>> message? I ask because I am getting that indication from some of my
>> co-workers and also from google cloud support. Not sure if this is the
>> right place to ask this question. Happy to move this conversation to
>> somewhere else if not.
>>
>> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <[email protected]>
>> wrote:
>>
>>> Even if you don't have the resource ids ahead of time, you can have a
>>> pipeline like:
>>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
>>> ParDo(ReadResourceIds) -> ...
>>>
>>> You could also compose these as splittable DoFns [1, 2, 3]:
>>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>>>
>>> The first approach is the simplest as the reshuffle will rebalance the
>>> reading of each resource id across worker nodes but is limited in
>>> generating resource ids on one worker. Making the generation a splittable
>>> DoFn will mean that you can increase the parallelism of generation which is
>>> important if there are so many that it could crash a worker or fail to have
>>> the output committed (these kinds of failures are runner dependent on how
>>> well they handle single bundles with large outputs). Making the reading
>>> splittable allows you to handle a large resource (imagine a large file) so
>>> that it can be read and processed in parallel (and will have similar
>>> failures if the runner can't handle single bundles with large outputs).
>>>
>>> You can always start with the first solution and swap either piece to be
>>> a splittable DoFn depending on your performance requirements and how well
>>> the simple solution works.
>>>
>>> 1: https://beam.apache.org/blog/splittable-do-fn/
>>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
>>> 3:
>>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>>
>>>
>>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <[email protected]>
>>> wrote:
>>>
>>>> Provided you have all the resources ids ahead of fetching, Beam will
>>>> spread the fetches to its workers. It will still fetch synchronously but
>>>> within that worker.
>>>>
>>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm planning to use Apache beam to extract and load part of the ETL
>>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. 
>>>>> With
>>>>> that pipelines will stall while REST requests are made over the network.
>>>>>
>>>>> Is it best practice to run the REST ingestion job on Dataflow? Is
>>>>> there any best practice I can follow to accomplish this? Just as a
>>>>> reference I'm adding this
>>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>>>> <https://beam.apache.org/documentation/io/built-in/> built-in
>>>>> connector is in progress for Java.
>>>>>
>>>>> Let me know if this is the right group to ask this question. I can
>>>>> also ask [email protected] if needed.
>>>>> --
>>>>> Thanks,
>>>>> Shree
>>>>>
>>>>
>>
>> --
>> Best,
>> Shree
>>
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Reply via email to