Re: [Question] Apache Beam Pipeline AWS Credentials

2024-01-10 Thread Alexey Romanenko
Hi Ramya, I don’t think there is a solution out-of-the-box for such case but I believe you can create your own AwsCredentialsProvider and implement the logic to fetch the credentials dynamically. PS: Please, use only user@beam.apache.org mailing list for such

Re: Why do we need Job Server?

2023-12-05 Thread Alexey Romanenko
n just uploaded as an > "ordinary" flink executable jar to the Flink master: > https://github.com/apache/beam/blob/release-2.52.0/sdks/python/apache_beam/runners/portability/abstract_job_service.py#L301 > If Java doens't do this yet we should probably update it to do so. > &g

Re: Control who can submit Beam jobs

2023-11-30 Thread Alexey Romanenko
No since Beam is not a runtime. In the end, it will create a Flink job and run it on a Flink cluster. So, it should be a responsibility of your Flink cluster. — Alexey > On 30 Nov 2023, at 10:14, Поротиков Станислав Вячеславович via user > wrote: > > Hello! > Is there any way to control who

Re: [Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-09 Thread Alexey Romanenko
eam 2.53.0 > release. > https://github.com/apache/beam/milestone/17 > > Thanks, > Giridhar. > > On Tue, Nov 7, 2023 at 6:24 PM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: >> Hi Giridhar, >> >>> On 4 Nov 2023, at 08:04, Giridhar

Re: Count(distinct) not working in beam sql

2023-11-03 Thread Alexey Romanenko
Unfortunatelly, Beam SQL doesn’t support COUNT(DISTINCT) aggregation. More details about “why" is on this discussion [1] and the related open issue for that here [2]. — Alexey [1] https://lists.apache.org/thread/hvmy6d5dls3m8xcnf74hfmy1xxfgj2xh [2] https://github.com/apache/beam/issues/19398

Re: [Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-03 Thread Alexey Romanenko
AFAICT, the latest tested (compatibility tests) version for now is 3.4.1 [1] We may try to add 3.5.x version there. I believe that ValidateRunners tests are run only against default Spark 3.2.2 version. — Alexey [1]

Re: [DISCUSS] Drop Euphoria extension

2023-10-16 Thread Alexey Romanenko
Can we just deprecate it for a while and then remove completely? — Alexey > On 13 Oct 2023, at 18:59, Jan Lukavský wrote: > > Hi, > > it has been some time since Euphoria extension [1] has been adopted by Beam > as a possible "Java 8 API". Beam has evolved from that time a lot, the >

Re: Cannot find a matching Beam FieldType for Calcite type: REAL

2023-10-13 Thread Alexey Romanenko
Seems like Calcite decided to use REAL for Float values in SQL transform, while Beam SQL (iinm) doesn’t have a conversion from Sql.REAL to any type of Beam schema field. A workaround could be to add such conversion (REAL -> FLOAT) into CalciteUtils.java — Alexey > On 12 Oct 2023, at 20:19,

Re: [Question] Does SnowflakeIO Connector in Java support Flink?

2023-10-13 Thread Alexey Romanenko
The rule of thumb in Beam says that all IO connectors (and other transforms) are supported by all runners written with the same SDK by default (thanks to Beam model). Since both SnowflakeIO and FlinkRunner are written natively in Java, so answer to your question is YES, SnowflakeIO should be

Re: "Decorator" pattern for PTramsforms

2023-09-18 Thread Alexey Romanenko
nsforms and of the distributed nature of >>>>>> PTransforms. I'm not suggesting limiting the entire set and my example >>>>>> was more illustrative than the actual use case. >>>>>> >>>>>> My actual use case is basically: I ha

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Alexey Romanenko
I don’t think it’s possible to extend in a way that you are asking (like, Java classes “extend"). Though, you can create your own composite PTransform that will incorporate one or several others inside “expand()” method. Actually, most of the Beam native PTransforms are composite transforms.

Re: Can we use RedisIO to write records from an unbounded collection

2023-07-21 Thread Alexey Romanenko
Hi Sachin, > On 21 Jul 2023, at 08:45, Sachin Mittal wrote: > > I was reading up on this IO here > https://beam.apache.org/documentation/io/connectors/ and it states that it > only supports batch and not streaming. I believe it states only about Reading support. For Writing, it mostly

Re: [Question] Processing chunks of data in batch based pipelines

2023-07-18 Thread Alexey Romanenko
erstanding correct? So it will eventually read the > entire dataset, loading it into memory. > > I haven't tried the 2nd option you suggested. Will try it out. > > Thank you > > On Mon, Jul 17, 2023 at 10:08 PM Alexey Romanenko <mailto:aromanenko@gmail.com>>

Re: [Question] Processing chunks of data in batch based pipelines

2023-07-17 Thread Alexey Romanenko
Hi Yomal, Actually, usually all data in Beam pipeline is processed by bundles (or chunks) if it processed by DoFn. The size of the bundle is up to your processing engine and, iirc, there is no way in Beam to change it. Talking about your case - did you try to change a fetch size for Beam’s

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Alexey Romanenko
> On 23 Jun 2023, at 17:40, Robert Bradshaw via user > wrote: > > On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: >> If Beam Runner Authoring Guide is rather high-level for you, then, at fist, >> I’d suggest to

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Alexey Romanenko
If Beam Runner Authoring Guide is rather high-level for you, then, at fist, I’d suggest to answer two questions for yourself: - Am I going to implement a portable runner or native one? - Which SDK I should use for this runner? Then, depending on answers, I’d suggest to take as an example one of

Re: Beam SQL found limitations

2023-05-22 Thread Alexey Romanenko
Hi Piotr, Thanks for details! I cross-post this to dev@ as well since, I guess, people there can provide more insights on this. A while ago, I faced the similar issues trying to run Beam SQL against TPC-DS benchmark. We had a discussion around that [1], please, take a look since it can be

Re: [java] Trouble with gradle and using ParquetIO

2023-04-26 Thread Alexey Romanenko
-io-parquet:${beamVersion}" >> ... >> >> Not sure why the CoderRegistry error comes up at runtime when both of the >> above deps are included. >> >> [1] >> https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-parquet/2.4

Re: [java] Trouble with gradle and using ParquetIO

2023-04-21 Thread Alexey Romanenko
Just curious. where it was documented like this? I briefly checked it on Maven Central [1] and the provided code snippet for Gradle uses “implementation” scope. — Alexey [1] https://search.maven.org/artifact/org.apache.beam/beam-sdks-java-io-parquet/2.46.0/jar > On 21 Apr 2023, at 01:52,

Re: Q: Apache Beam IOElasticsearchIO.read() method (Java), which expects a PBegin input and a means to handle a collection of queries

2023-04-20 Thread Alexey Romanenko
Some Java IO-connectors implement a class something like "class ReadAll extends PTransform, PCollection>” where “Read” is supposed to be configured dynamically. As a simple example, take a look on “SolrIO” [1] So, to support what you are looking for, “ReadAll”-pattern should be implemented

Re: ClassCastException when I have bytes type in row

2023-04-18 Thread Alexey Romanenko
Hi Jeff, Sorry for delay with an answer. Could you give more details (e.g. your pipeline and sql query code snippets) about how to reproduce this issue? Which Beam version do you use? — Alexey > On 30 Mar 2023, at 03:21, Jeff Zhang wrote: > > > Hi, folks, > > I asked this question in

Re: major reduction is performance when using schema registry - KafkaIO

2023-04-13 Thread Alexey Romanenko
can assist > in anyway, > > Thanks > Sigalit > > On Wed, Apr 12, 2023 at 8:00 PM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: >> Mine was the similar but >> "org.apache.beam.sdk.io.

Re: major reduction is performance when using schema registry - KafkaIO

2023-04-12 Thread Alexey Romanenko
afka schemas in the next couple > of quarters, so I'll keep this case in mind while working on that. > > John > > On Tue, Apr 11, 2023 at 10:29 AM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: >> I don’t have an exact answer why it’s so much slower for now (

Re: major reduction is performance when using schema registry - KafkaIO

2023-04-11 Thread Alexey Romanenko
I don’t have an exact answer why it’s so much slower for now (only some guesses but it requires some profiling), though could you try to test the same Kafka read but with “ConfluentSchemaRegistryDeserializerProvider” instead of KafkaAvroDeserializer and AvroCoder? More details and an example

Re: Recommended way to set coder for JdbcIO with Apache Beam

2023-03-27 Thread Alexey Romanenko
eadAll() > transform. Provide a coder via withCoder, or ensure that one can be inferred > from the provided RowMapper. > > > > > Juan Cuzmar. > > --- Original Message --- > On Thursday, March 23rd, 2023 at 2:17 PM, Alexey Romanenko > wrote: > >

Re: Recommended way to set coder for JdbcIO with Apache Beam

2023-03-23 Thread Alexey Romanenko
>public class InboundData implements Serializable { > >String payload; >Map attributes; >List stores; >} > > > --- Original Message --- > On Thursday, March 23rd, 2023 at 8:14 AM, Alexey Romanenko > wrote: > > >> Could

Re: Recommended way to set coder for JdbcIO with Apache Beam

2023-03-23 Thread Alexey Romanenko
Could you share a class declaration of your InboundData class? Is it just a POJO? — Alexey > On 23 Mar 2023, at 08:16, Juan Cuzmar wrote: > > Hello all, > > I hope this message finds you well. I am currently working with Apache Beam's > JdbcIO and need some guidance regarding setting a

Re: Why is FlatMap different from composing Flatten and Map?

2023-03-15 Thread Alexey Romanenko
+CC people who might give more details on this. — Alexey > On 13 Mar 2023, at 19:32, Godefroy Clair wrote: > > Hi, > I am wondering about the way `Flatten()` and `FlatMap()` are implemented in > Apache Beam Python. > In most functional languages, FlatMap() is the same as composing `Flatten()`

Re: Backup event from Kafka to S3 in parquet format every minute

2023-02-17 Thread Alexey Romanenko
Piotr, > On 17 Feb 2023, at 09:48, Wiśniowski Piotr > wrote: > Does this mean that Parquet IO does not support partitioning, and we need to > do some workarounds? Like explicitly mapping each window to a separate > Parquet file? > Could you elaborate a bit more on this? IIRC, we used to

Re: Beam CassandraIO

2023-02-02 Thread Alexey Romanenko
- d...@beam.apache.org + user@beam.apache.org Hi Enzo, Can you make sure that all your workers were properly added and listed in Spark WebUI? Did you specify a “ --master spark://HOST:PORT” option while running your Beam job with a SparkRunner? PS: Please, use user@beam.apache.org mailing

Re: KafkaIo Metrics

2023-01-20 Thread Alexey Romanenko
IIRC, we don’t expose any Kafka Consumer metrics, so I’m afraid, that there is no easy way to get them in a Beam pipeline. — Alexey > On 18 Jan 2023, at 21:43, Lydian wrote: > > Hi, > I know that Beam KafkaIO doesn't use the native kafka offset, and therefore I > cannot use kafka metrics

Re: [Question] Trino Connector

2022-12-26 Thread Alexey Romanenko
Hi Ammar, I don’t think we have a dedicated connector for Trino. I also didn’t find any open Beam GitHub issue for this. On the other hand, it appears that Trino provides a JDBC client driver [1], so you may want to try to use Beam JdbcIO connector [2] in this case. [1]

Re: Beam, Flink state and Avro Schema Evolution is problematic

2022-11-23 Thread Alexey Romanenko
+ dev Many thanks for sharing your observations and findings on this topic, Cristian! I copy it to dev@ as well to attract more attention to this problem. — Alexey > On 18 Nov 2022, at 18:21, Cristian Constantinescu wrote: > > Hi everyone, > > I'm using Beam on Flink with Avro generated

Re: Reading from AWS Kinesis Stream Cross account

2022-11-14 Thread Alexey Romanenko
If I’m not mistaken, it’s not supported in the current implementation of KinesisIO. PS: Btw, if you still use KinesisIO connector based on AWS SDK v1 [1] then it’s highly recommended to switch to one based on AWS SDK v2 [2] since former is deprecated. [1]

Re: Request to suggest alternative approaches for side input use cases in apache beam

2022-10-26 Thread Alexey Romanenko
Well, it depends on how you do use a Redis cache and how often it’s changing. For example, if you need to request a cache for a group of input records then you can group them into batches and do only one remote call to cache before processing this batch, like explained here [1] In any case,

Re: RabbitMQ Message Print

2022-10-26 Thread Alexey Romanenko
Hi Hitesh, Let me ask you some questions on this: 1) Did you try to run this pipeline with Direct runner locally? 2) Can you make sure that output messages were successfully written in the same queue as you read from after? 3) Can you make sure that your “processElement()” has been called while

Re: Java + Python Xlang pipeline

2022-10-11 Thread Alexey Romanenko
I’m not sure that I get it correctly. What do you mean by “worker pool” in your case? — Alexey > On 8 Oct 2022, at 03:24, Xiao Ma wrote: > > Hello, > > I would like to run a pipeline with Java as the main language and python > transformation embedded. The beam pipeline is running on the

Re: Cross Language

2022-10-11 Thread Alexey Romanenko
Yes, it’s possible though Java IO connector should support being used via X-language. For more details regarding which connector supports this, you may want to take a look on IO connectors table on this page [1] and check if the required connector is supported "via X-language" for Python SDK.

Re: [Question] Beam 2.42.0 Release Date Confirmation

2022-10-03 Thread Alexey Romanenko
It’s still in progress of release validation since there is no enough votes for the moments: https://lists.apache.org/thread/grgybp1m1mqx9rdy65czbh0wr1fz0ovg Once the release will be voted and approved, then it will be published

Re: [Question] Handling failed records when using JdbcIO

2022-09-19 Thread Alexey Romanenko
Hi, I don’t think it’s possible “out-of-the-box” now but it could be a useful add-on for JdbcIO connector (dead-letter pcollection) since, iirc, it was already asked several times by users. For the moment, it’s only possible to play with RetryStrategy/RetryConfiguration in case of failures. —

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-16 Thread Alexey Romanenko
.java#L937 > > <https://github.com/apache/kafka/blob/1135f22eaf404fdf76489302648199578876c4ac/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L937> > On Tue, Sep 13, 2022 at 12:17 PM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: > Do you h

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

2022-09-13 Thread Alexey Romanenko
Do you have by any chance the full stacktrace of this error? — Alexey > On 13 Sep 2022, at 18:05, Evan Galpin wrote: > > Ya likewise, I'd expect this to be handled in the Kafka code without the need > for special handling by Beam. I'll reach out to Kafka mailing list as well > and try to

Re: read messages from kakfa: 2 different message types in kafka topic

2022-08-10 Thread Alexey Romanenko
input message or split this topic into two where every topic contains the messages with only one schema or use Avro union as it was suggested above. — Alexey > On 10 Aug 2022, at 15:03, Alexey Romanenko wrote: > > If you have two topics with different schemas in your pipeline then

Re: read messages from kakfa: 2 different message types in kafka topic

2022-08-10 Thread Alexey Romanenko
If you have two topics with different schemas in your pipeline then you need to read them separately with two different KafkaIO instances and configure every instance with a proper deserialiser based on its schema. — Alexey > On 9 Aug 2022, at 22:28, Sigalit Eliazov wrote: > > Thanks for

Re: Possible bug in ConfluentSchemaRegistryDeserializerProvider withe schema evolution

2022-07-29 Thread Alexey Romanenko
Thanks for the question! Interesting... I didn’t go deep into the details yet (I will!) but can it be related to this change? [1][2] [1] https://issues.apache.org/jira/browse/BEAM-10759 [2] https://github.com/apache/beam/pull/12630

Re: Oracle Database Connection Pool creation from Beam

2022-07-29 Thread Alexey Romanenko
Additionally to what Moritz said, I just wanted to give an example of using JdbcIO.PoolableDataSourceProvider taken from JdbcIO Javadoc: To customize the building of the DataSource we can provide a SerializableFunction. For example if you need to provide a PoolingDataSource from an existing

Re: RedisIO Apache Beam JAVA Connector

2022-07-20 Thread Alexey Romanenko
our reply & github > issue 21825, it seems SDF is causing some issue in reading from Redis. > > Do you know of any issues with Write? > > If I get a chance to test the reading in my staging environment, I will :) > > Thanks, > Shivam Singhal > > On Mon, 18 Jul

Re: RedisIO Apache Beam JAVA Connector

2022-07-18 Thread Alexey Romanenko
Hi Shivam, RedisIO is already for quite a long time in Beam, so we may consider it’s rather stable. I guess it was marked @Experimental since its user API was changing at that moment (that a point) [1]. However, recently RedisIO has moved to SDF for a reading part, so I can’t say how it was

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
> On 23 May 2022, at 20:40, Brian Hulette wrote: > > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution? I’m not sure that a runner specific

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
To add a bit more to what Robert suggested. Right, in general we can’t read Spark RDD directly with Beam (Spark runner uses RDD under the hood but it’s a different story) but you can write the results to any storage and in data format that Beam supports and then read it with a corespondent Beam

[PROPOSAL] Stop Spark 2 support in Spark Runner

2022-04-29 Thread Alexey Romanenko
Any objections or comments from Spark 2 users on this topic? — Alexey On 20 Apr 2022, at 19:17, Alexey Romanenko wrote: Hi everyone, A while ago, we already discussed on dev@ that there are several reasons to stop provide a support of Spark2 in Spark Runner (in all its variants that we

Re: [SURVEY] Deprecation of Beam AWS IOs v1 (Java)

2022-04-26 Thread Alexey Romanenko
+1 to deprecate AWS Java SDK v1 connectors in the next Beam release. — Alexey > On 26 Apr 2022, at 18:51, Moritz Mack wrote: > > Hi Beam AWS user, > > as you might know, there’s currently two different versions of AWS IO > connectors in Beam for the Java SDK: > amazon-web-services [1] and

Re: JdbcIO

2022-04-22 Thread Alexey Romanenko
I don’t think it exists. Do you really need to have an unbounded pipeline, meaning that the data will continuously arrive, or just re-running a batch pipeline once per some amount of time or externally triggered by some signal shouldn’t be enough? — Alexey > On 22 Apr 2022, at 13:40, Eric

Re: [Code Question] Pcollection to List using Java sdk

2022-04-21 Thread Alexey Romanenko
that sends mail, with body as values from > Pcollection, In a tabular format. The number of elements in > Pcollection will be less than 10 always. > > Regards, > Kayal > >> >> On Apr 21, 2022, at 5:13 AM, Alexey Romanenko >> wrote: >> >> Hi K

Re: [Code Question] Pcollection to List using Java sdk

2022-04-21 Thread Alexey Romanenko
Hi Kayal, In general, PCollection is infinite collection of elements. So, there is no only one simple way to do what you are asking and the solution will depend on a case where it’s needed. Could you give an example why and where in your pipeline you do need this? — Alexey > On 21 Apr 2022,

Re: [PROPOSAL] Stop Spark2 support in Spark Runner

2022-04-20 Thread Alexey Romanenko
://lists.apache.org/thread/opfhg3xjb9nptv878sygwj9gjx38rmco > On 31 Mar 2022, at 17:51, Alexey Romanenko wrote: > > Hi everyone, > > For the moment, Beam Spark Runner supports two versions of Spark - 2.x and > 3.x. > > Taking into account the several things that: > - almost

Re: KafkaIO consumer rate

2022-04-11 Thread Alexey Romanenko
Hi Sigalit, Could you try to run your test pipeline with “--experiments=use_deprecated_read” option and see if there is a difference? — Alexey > On 10 Apr 2022, at 21:14, Sigalit Eliazov wrote: > > Hi all > I saw a very low rate when message consuming from kafka in our different > jobs. >

Re: [Bug]

2022-04-08 Thread Alexey Romanenko
*** I move it to user@beam.apache.org since it’s rather user-related question *** Hello Liu, Could you check which versions of Jackson you have in your Spark class path while running a job? — Alexey > On 8 Apr 2022, at 07:01, Liu Jie wrote: > > Dear Sir/Madam, > > I followed the

Re: Need help with designing a beam pipeline for data enrichment

2022-03-31 Thread Alexey Romanenko
Hi Johannes, Agree, it looks like a general data processing problem when you need to enrich a small dataset with a data from a large one. There can be the different solutions depending on your environment and available resources. At the first glance, your proposal sounds reasonable for me - so

Re: Support null values in kafkaIO

2022-03-28 Thread Alexey Romanenko
Thank you for working on this, John! This case with null key/values seems quite demanded. — Alexey > On 28 Mar 2022, at 21:58, John Casey wrote: > > Unfortunately, there isn't a workaround at the moment. I'm nearing completion > of the fix. > >

Re: [Question] Spark: standard setup to use beam-spark to parallelize python code

2022-03-28 Thread Alexey Romanenko
y [1] https://github.com/Talend/beam-samples/blob/9288606495b9ba8f77383cd9709ed9b5783deeb8/pom.xml#L66 > > any ideas? > > On 2022/03/28 17:38:13 Alexey Romanenko wrote: > > Well, it’s caused by recent jackson's version update in Beam [1] - so, the > > jackson runtime dependencies should be updated ma

Re: [Question] Spark: standard setup to use beam-spark to parallelize python code

2022-03-28 Thread Alexey Romanenko
Well, it’s caused by recent jackson's version update in Beam [1] - so, the jackson runtime dependencies should be updated manually (at least to 2.9.2) in case of using Spark 2.x. Either, use Spark 3..x if possible since it already provides jackson jars of version 2.10.0. [1]

Re: Write S3 File with CannedACL

2022-03-10 Thread Alexey Romanenko
> contribute from our end? > > Thanks > > On Wed, Mar 9, 2022 at 9:59 AM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: > Hi Yushu, > > I’m not sure that we have a workaround for that since the related jira issue > [1] is still open. > > Si

Re: Write S3 File with CannedACL

2022-03-09 Thread Alexey Romanenko
Hi Yushu, I’m not sure that we have a workaround for that since the related jira issue [1] is still open. Side question: are you interested only in multipart version or both? — Alexey [1] https://issues.apache.org/jira/browse/BEAM-10850 > On 9 Mar 2022, at 00:19, Yushu Yao wrote: > > Hi

Re: Running small app using Apache Beam, KafkaIO, Azure EventHuband Databricks Spark

2022-02-02 Thread Alexey Romanenko
$sp.java:23) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java

Re: Running small app using Apache Beam, KafkaIO, Azure EventHuband Databricks Spark

2022-02-01 Thread Alexey Romanenko
t; beam-sdks-java-io-kafka > 2.35.0 > > > On Tue, Feb 1, 2022 at 9:01 AM Utkarsh Parekh <mailto:utkarsh.s.par...@gmail.com>> wrote: > Here it is > > > org.apache.kafka > kafka-clients > 2.8.0 > > > On Tue, Feb 1, 2022 at 8

Re: Running small app using Apache Beam, KafkaIO, Azure EventHuband Databricks Spark

2022-02-01 Thread Alexey Romanenko
ot;kafka.sasl.jaas.config", EH_SASL) > .option("kafka.request.timeout.ms <http://kafka.request.timeout.ms/>", > "6") > .option("kafka.session.timeout.ms <http://kafka.session.timeout.ms/>", > "6") >

Re: Running small app using Apache Beam, KafkaIO, Azure EventHuband Databricks Spark

2022-02-01 Thread Alexey Romanenko
Hi Utkarsh, Can it be related to this configuration problem? https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-troubleshooting-guide#no-records-received Did you check timeout

Re: [Question] Can Apache beam used for chaining IO transforms

2022-01-27 Thread Alexey Romanenko
> Happy to chat about it - I'd guess I'm halfway done but if you've got > something I'd love to chat about it. > > Thanks, > > Wei > > > > > Wei Hsia > Customer Engineer, Analytics Specialist > Google Cloud > weih...@google.com <mailto:weih...@goog

Re: [Question] Can Apache beam used for chaining IO transforms

2022-01-26 Thread Alexey Romanenko
ndow to get > triggered shouldn't we apply some transform like GroupByKey? or else the > window is ignored right? As we dont have such a transform applied, we did not > observe any windowing behavior in the pipeline. > > On Wed, Jan 26, 2022 at 10:04 PM Alexey Romanenko <

Re: [Question] Can Apache beam used for chaining IO transforms

2022-01-26 Thread Alexey Romanenko
that's the use case here. > > Tried the approach that you have suggested, but for the window to get > triggered shouldn't we apply some transform like GroupByKey? or else the > window is ignored right? As we dont have such a transform applied, we did not > observe any w

Re: [Question] Can Apache beam used for chaining IO transforms

2022-01-26 Thread Alexey Romanenko
Hi Yomal de Silva, IIUC, you need to pass downstream the records only once they were stored in DB with JdbcIO, don’t you? And your pipeline is unbounded. So, why not to split into windows your input data PCollection (if it was not done upstream) and use Wait.on() with

Re: [RFC][design] Standardizing Beam IO connectors

2022-01-25 Thread Alexey Romanenko
Thanks Pablo and others for creating such doc, great initiative! I left my comments. After all corrections and discussions, I suppose, it should be added to Beam Contribution Guide. — Alexey > On 24 Jan 2022, at 17:00, Pablo Estrada wrote: > > Hi all! > > A few of us have started putting

Re: Testing a jvm pipeline on the portability framework locally

2022-01-18 Thread Alexey Romanenko
Yes, you can do it locally, for example, by running your pipeline via portable Spark or Flink runner. See the details here [1] (select "Portable (Java/Python/Go)” tab for proper code/examples) and here [2]. And the moment of self-advertisement - I gave a talk about cross-language pipeline with

Re: Calcite/BeamSql

2022-01-11 Thread Alexey Romanenko
If I understand your problem right, you can just use JdbcIO.readRows(), which returns a PCollection and can be used downstream to create a PCollectionTuple, which, in its turn, already contains another PCollection from your Kafka source. So, once you have a PCollectionTuple with two TupleTags

Re: Reading from BigTable with Beam Python SDK

2022-01-07 Thread Alexey Romanenko
Yes but I think it’s caused by the same description for Java and Python SDKs and Java connector supports both - Read and Write. So, we need to reflect this there, thanks for pointing out. — Alexey > On 7 Jan 2022, at 14:50, Pierre Oberholzer > wrote: > > Hi, > > If this discussion

Re: Compatibility of Spark portable runner

2022-01-05 Thread Alexey Romanenko
Generally speaking, to avoid the potential issues the versions that are used in compile time and in runtime should be the same (most important is Scala versions) but, due to Spark backward compatibility, the minor versions can be different. > On 5 Jan 2022, at 07:50, Zheng Ni wrote: > > Hi

Re: Kafka manually commit offsets

2021-12-10 Thread Alexey Romanenko
I answered the similar questions on SO a while ago [1], and I hope it will help. “By default, pipeline.apply(KafkaIO.read()...) will return a PCollection>. So, downstream in your pipeline you can get an offset from KafkaRecord metadata and commit it manually in a way that you need (just don't

Re: [Question]

2021-10-28 Thread Alexey Romanenko
Hi, You just need to create a shaded jar for SparkRunner and submit it with "spark-submit” and CLI options “--deploy-mode cluster --master yarn”. Also, you need to specify “--runner=SparkRunner --sparkMaster=yarn” as pipeline options. — Alexey > On 26 Oct 2021, at 19:07, Holt Spalding

Re: Performance of Apache Beam

2021-10-19 Thread Alexey Romanenko
+ Azhar (just in case) > On 18 Oct 2021, at 11:30, Jan Lukavský wrote: > > Hi Azhar, > > -dev +user > this kind of question cannot be answered in general. The overhead will depend > on the job and the SDK you use. Using Java SDK with

Re: Perf issue with Beam on spark (spark runner)

2021-10-12 Thread Alexey Romanenko
Date: Tuesday, September 7, 2021 at 2:15 PM > To: "user@beam.apache.org <mailto:user@beam.apache.org>" > mailto:user@beam.apache.org>> > Cc: Alexey Romanenko <mailto:aromanenko@gmail.com>>, Andrew Pilloud <mailto:apill...@google.com>>, Ismaë

Re: Apache Beam 2.31 over Flink 1.13 with Java 11 throws RuntimeException

2021-09-28 Thread Alexey Romanenko
CC: dev@ > On 23 Sep 2021, at 15:48, Ohad Pinchevsky wrote: > According to the blog in Beam website Java 11 is already supported since 2.27 > https://beam.apache.org/blog/beam-2.27.0/ > I believe it’s related only for Java SDK containers whereas

Re: KinesisIO - support for enhanced fan-out

2021-08-30 Thread Alexey Romanenko
There is already an open Jira for this feature [1]. Pavel, feel free to add your thoughts on this there or on dev@. I left my comment on related PR [2] and I’d be happy to help with moving this feature forward. It would be great to re-use the work that was already done. — Alexey [1]

Re: Perf issue with Beam on spark (spark runner)

2021-08-06 Thread Alexey Romanenko
plittable. Thanks! > > From: Alexey Romanenko > Date: Thursday, August 5, 2021 at 6:40 AM > To: Tao Li > Cc: "user@beam.apache.org" , Andrew Pilloud > , Ismaël Mejía , Kyle Weaver > , Yuchu Cao > Subject: Re: Perf issue with Beam on spark (spark runner) >

Re: Perf issue with Beam on spark (spark runner)

2021-08-05 Thread Alexey Romanenko
w?usp=sharing > On 5 Aug 2021, at 03:07, Tao Li wrote: > > @Alexey Romanenko <mailto:aromanenko@gmail.com> @Ismaël Mejía > <mailto:ieme...@gmail.com> I assume you are experts on spark runner. Can you > please take a look at this thread and confirm this

Re: Beam Calcite SQL SparkRunner Performance

2021-07-06 Thread Alexey Romanenko
1, at 04:39, Tao Li wrote: > > @Alexey Romanenko <mailto:aromanenko@gmail.com> do you have any thoughts > on this issue? Looks like the dag compiled by “Beam on Spark” has many more > stages than native spark, which results in more shuffling and thus longer > processin

Re: SparkRunner

2021-07-01 Thread Alexey Romanenko
Hi Trevor, Beam Portable Spark pipeline in the end is just a Spark pipeline which you run on Spark cluster. So, all resources are managed by processing engine (Spark in your case) and cluster configuration, Beam doesn’t handle errors on this level. So, on your place, I’d investigate this issue

Re: How to specify a spark config with Beam spark runner

2021-06-10 Thread Alexey Romanenko
Hi Tao, "Limited spark options”, that you mentioned, are Beam's application arguments and if you run your job via "spark-submit" you should still be able to configure Spark application via normal spark-submit “--conf key=value” CLI option. Doesn’t it work for you? — Alexey > On 10 Jun 2021,

Re: Oracle JDBC driver with expansion service

2021-05-25 Thread Alexey Romanenko
Hi, This question looks more as a user-related question, so let's continue this conversation on user@ — Alexey > On 25 May 2021, at 15:32, Rafael Ribeiro wrote: > > Hi, > > I'm trying to read and write on Oracle database using the JDBC driver of Beam > > but I'm having some problems,

Re: JdbcIO parallel read on spark

2021-05-25 Thread Alexey Romanenko
Hi, Did you check a Spark DAG if it doesn’t fork branches after "Genereate queries” transform? — Alexey > On 24 May 2021, at 20:32, Thomas Fredriksen(External) > wrote: > > Hi there, > > We are struggling to get the JdbcIO-connector to read a large table on spark. > > In short - we wish

Re: File processing triggered from external source

2021-05-25 Thread Alexey Romanenko
You don’t need to use windowing strategy or aggregation triggers for a pipeline with bounded source to perform GbK-like transforms, but since you started to use unbounded source then your pcollections became unbounded and you need to do that. Otherwise, it’s unknown at which point of time your

Re: KafkaIO with DirectRunner is creating tons of connections to Kafka Brokers

2021-05-25 Thread Alexey Romanenko
> On 24 May 2021, at 10:43, Sozonoff Serge wrote: > > OK thanks. Just to clarify, in my case the message throughput is zero when I > start the Beam pipeline up and it will still crash once all file handles are > consumed even if I dont send a single message to the kafka topic. This sounds

Re: Exporting beam custom metrics to Prometheus

2021-04-28 Thread Alexey Romanenko
Hi, Could it be related [1][2]? [1] https://issues.apache.org/jira/browse/BEAM-7438 [2] https://issues.apache.org/jira/browse/BEAM-10928 — Alexey > On 28 Apr 2021, at 08:45, Feba Fathima wrote: > > Hi, > >I have a java beam pipeline which is run using a flink runner. I tried > adding

Re: Writing to multiple S3 buckets in multiple regions

2021-04-28 Thread Alexey Romanenko
Hi Valeri, For now it’s not possible to write to the different AWS regions from the same write transform instance. There is an open Jira about this [1]. As a workaround (not very effective maybe, I didn’t try) but I guess you can branch your input PCollection into several branches, depending

Re: Avoiding OutOfMemoryError for large batch-jobs

2021-04-27 Thread Alexey Romanenko
at timeout? Well, I think it can but I don’t see how it can lead to OOM in this case. Did you see something about this in logs? > > On Mon, Apr 26, 2021 at 1:48 PM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: > > >> On 26 Apr 2021, at 13:34, Thomas F

Re: Avoiding OutOfMemoryError for large batch-jobs

2021-04-26 Thread Alexey Romanenko
nners.spark.translation.EvaluationContext.computeOutputs(EvaluationContext.java:228) > at > org.apache.beam.runners.spark.SparkRunner.lambda$run$1(SparkRunner.java:241) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(Fut

Re: Avoiding OutOfMemoryError for large batch-jobs

2021-04-26 Thread Alexey Romanenko
Hi Thomas, Could you share the stack trace of your OOM and, if possible, the code snippet of your pipeline? Afaik, usually only “large" GroupByKey transforms, caused by “hot keys”, may lead to OOM with SparkRunner. — Alexey > On 26 Apr 2021, at 08:23, Thomas Fredriksen(External) > wrote:

Re: JdbcIO SQL best practice

2021-04-15 Thread Alexey Romanenko
wrote: > > This seems very promising, > > Will the write from PCollectino handle upserts? > > On Wed, Mar 24, 2021 at 6:56 PM Alexey Romanenko <mailto:aromanenko@gmail.com>> wrote: > Thanks for details. > > If I’m not mistaken, JdbcIO already supports b

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Alexey Romanenko
r. I don't know much about KinesisIO or how Kinesis works. I was just > asking to clarify, in case other folks know more, like +Alexey Romanenko > <mailto:aromanenko@gmail.com> and +Ismaël Mejía > <mailto:ieme...@gmail.com> have modified KinesisIO. If the feature doe

Re: Write to multiple IOs in linear fashion

2021-03-25 Thread Alexey Romanenko
something. > > > > Returning PDone is an anti-pattern that should be avoided, but changing it > > now would be backwards incompatible. > > Periodic reminder most IOs are still Experimental so I suppose it is > worth to the maintainers to judge if the upgrade to return

  1   2   3   >