Re: RedisIO Apache Beam JAVA Connector

2022-07-19 Thread Shivam Singhal
Hi Alexey!

Thanks for replying.
I think we will only use RedisIO to write to redis. From your reply &
github issue 21825, it seems SDF is causing some issue in reading from
Redis.

Do you know of any issues with Write?

If I get a chance to test the reading in my staging environment, I will :)

Thanks,
Shivam Singhal

On Mon, 18 Jul 2022 at 22:22, Alexey Romanenko 
wrote:

> Hi Shivam,
>
> RedisIO is already for quite a long time in Beam, so we may consider it’s
> rather stable. I guess it was marked @Experimental since its user API was
> changing at that moment (that a point) [1].
>
> However, recently RedisIO has moved to SDF for a reading part, so I can’t
> say how it was heavily tested in production system. AFAICT, there is an
> open issue [2] that is likely related to this.
>
> It would be great if you could test this IO in your testing enviroment and
> provide some feedback how it works for your cases.
>
> —
> Alexey
>
> [1] https://issues.apache.org/jira/browse/BEAM-9231
> [2] https://github.com/apache/beam/issues/21825
>
>
> On 18 Jul 2022, at 02:19, Shivam Singhal 
> wrote:
>
> Hi everyone,
>
> I see that org.apache.beam.sdk.io.redis
> 
>  version
> 2.20.0 onwards, this connector is marked experimental.
>
> I tried to see the changelog
> 
> for v2.20.0 but could not find an explanation.
>
> I am working with apache beam 2.40.0 and wanted to know which classes and
> functions are marked experimental in *org.apache.beam.sdk.io.redis:2.40.0
> *? Is it safe to use in production environments?
>
> Thanks!
>
>
>
>


Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

2022-07-19 Thread Luke Cwik via user
Even if you don't have the resource ids ahead of time, you can have a
pipeline like:
Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
ParDo(ReadResourceIds) -> ...

You could also compose these as splittable DoFns [1, 2, 3]:
ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)

The first approach is the simplest as the reshuffle will rebalance the
reading of each resource id across worker nodes but is limited in
generating resource ids on one worker. Making the generation a splittable
DoFn will mean that you can increase the parallelism of generation which is
important if there are so many that it could crash a worker or fail to have
the output committed (these kinds of failures are runner dependent on how
well they handle single bundles with large outputs). Making the reading
splittable allows you to handle a large resource (imagine a large file) so
that it can be read and processed in parallel (and will have similar
failures if the runner can't handle single bundles with large outputs).

You can always start with the first solution and swap either piece to be a
splittable DoFn depending on your performance requirements and how well the
simple solution works.

1: https://beam.apache.org/blog/splittable-do-fn/
2: https://beam.apache.org/blog/splittable-do-fn-is-available/
3: https://beam.apache.org/documentation/programming-guide/#splittable-dofns


On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan 
wrote:

> Provided you have all the resources ids ahead of fetching, Beam will
> spread the fetches to its workers. It will still fetch synchronously but
> within that worker.
>
> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna  wrote:
>
>> Hi all,
>>
>> I'm planning to use Apache beam to extract and load part of the ETL
>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>> that pipelines will stall while REST requests are made over the network.
>>
>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>> any best practice I can follow to accomplish this? Just as a reference I'm
>> adding this
>> 
>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>  built-in connector
>> is in progress for Java.
>>
>> Let me know if this is the right group to ask this question. I can also
>> ask d...@beam.apache.org if needed.
>> --
>> Thanks,
>> Shree
>>
>


Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

2022-07-19 Thread Damian Akpan
Provided you have all the resources ids ahead of fetching, Beam will spread
the fetches to its workers. It will still fetch synchronously but within
that worker.

On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna  wrote:

> Hi all,
>
> I'm planning to use Apache beam to extract and load part of the ETL
> pipeline and run the jobs on Dataflow. I will have to do the REST API
> ingestion on our platform. I can opt to make sync API calls from DoFn. With
> that pipelines will stall while REST requests are made over the network.
>
> Is it best practice to run the REST ingestion job on Dataflow? Is there
> any best practice I can follow to accomplish this? Just as a reference I'm
> adding this
> 
> StackOverflow thread here too. Also, I notice that Rest I/O transform
>  built-in connector
> is in progress for Java.
>
> Let me know if this is the right group to ask this question. I can also
> ask d...@beam.apache.org if needed.
> --
> Thanks,
> Shree
>


[Dataflow][Python] Guidance on HTTP ingestion on Dataflow

2022-07-19 Thread Shree Tanna
Hi all,

I'm planning to use Apache beam to extract and load part of the ETL
pipeline and run the jobs on Dataflow. I will have to do the REST API
ingestion on our platform. I can opt to make sync API calls from DoFn. With
that pipelines will stall while REST requests are made over the network.

Is it best practice to run the REST ingestion job on Dataflow? Is there any
best practice I can follow to accomplish this? Just as a reference I'm
adding this

StackOverflow thread here too. Also, I notice that Rest I/O transform
 built-in connector is
in progress for Java.

Let me know if this is the right group to ask this question. I can also ask
d...@beam.apache.org if needed.
-- 
Thanks,
Shree