Re: Expansion service for SqlTranform fails with a local flink cluster using Python SDK

2024-03-13 Thread Chamikara Jayalath via user
> When I check the expansion service docker container, normally it downloads a JAR file and starts SDK Fn Harness To clarify the terminology here, I think you meant the Java SDK harness container not the expansion service. Expansion service is only needed during job submission and your failure is

Re: [Question] ReadFromKafka can't get messages.

2024-03-08 Thread Chamikara Jayalath via user
Which runner are you using ? There's a known issue with SDFs not triggering for portable runners: https://github.com/apache/beam/issues/20979 This should not occur for Dataflow. For Flink, you could use the option "--experiments=use_deprecated_read" to make it work. Thanks, Cham On Fri, Mar 8,

Re: Downloading and executing addition jar file when using Python API

2024-01-23 Thread Chamikara Jayalath via user
The expansion service jar is needed since sql.py includes cross-language transforms that use the Java implementation behind the hood. Once downloaded, the jar is cached, and subsequent jobs should use the jar from that location. If you want to use a locally available jar, you can manually

Re: [Question] - How can i listen to multiple Pub/Sub input topics using SideInput?

2023-11-07 Thread Chamikara Jayalath via user
would also suggest rethinking Your infrastructure setup. >> >> Best >> Wiśniowski Piotr >> >> śr., 1 lis 2023, 19:06 użytkownik Chamikara Jayalath via user < >> user@beam.apache.org> napisał: >> >>> Currently only some Beam sources are able to co

Re: [Question] - How can i listen to multiple Pub/Sub input topics using SideInput?

2023-11-01 Thread Chamikara Jayalath via user
Currently only some Beam sources are able to consume a configuration (set of topics here) that is dynamically generated and I don't think PubSubIO is one of them. So probably you'll have to implement a custom DoFn that reads from Cloud Pub/Sub to support this. Also, probably you'll have to

Re: [Request for Feedback] Swift SDK Prototype

2023-09-20 Thread Chamikara Jayalath via user
e build-time integrations with multi-language as well (which >> is possible in Swift through compiler plugins) in the same way as a >> pipeline author would. You also maybe get backwards compatibility testing >> as a side effect in that case as well. >> >> >> >&

Re: [Request for Feedback] Swift SDK Prototype

2023-09-20 Thread Chamikara Jayalath via user
s it into a >>>>> struct with friendlier names. Not strictly necessary, but makes the code >>>>> nicer to read I think. POutput introduces emit functions that optionally >>>>> allow you to specify a timestamp and a window. If you don't for

Re: [Question] Multiple images per pipeline

2023-09-05 Thread Chamikara Jayalath via user
It's possible in theory but currently we don't have a good API for replacing the environment of a given transform when defining a pipeline. Environments are configured during transform expansion and if transforms use expansion services (with different dependencies) they will get unique

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Chamikara Jayalath via user
n any case, I'm updating the branch as I find a minute here and >>>>> there. >>>>> >>>>> Best, >>>>> B >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>

Re: [Request for Feedback] Swift SDK Prototype

2023-08-17 Thread Chamikara Jayalath via user
Thanks Byron. This sounds great. I wonder if there is interest in Swift SDK from folks currently subscribed to the +user list. On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev wrote: > Hello everyone, > > A couple of months ago I decided that I wanted to really understand how > the Beam

[ANNOUNCE] Transform Service

2023-08-10 Thread Chamikara Jayalath via user
Hi All, We recently added a Docker Compose based service named Transform Service to Beam. Transform service includes a number of transforms released with Beam and provides a single endpoint for accessing them via the Beam's multi-language pipelines framework. I've updated Beam Java/Python SDKs

Re: Create IO connector for HTTP or ParDO

2023-06-23 Thread Chamikara Jayalath via user
Connectors are written using ParDos. A connector (source) may use a source framework (Splittable DoFn is the recommended framework currently) or may be written using regular ParDos. The main advantages of a source framework are various features provided by such frameworks (progress reporting,

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Chamikara Jayalath via user
Another advantage of a portable runner would be that it will be using well defined and backwards compatible Beam portable APIs to communicate with SDKs. I think this is specially important for runners that do not live in the Beam repo since otherwise future SDK releases could break your runner in

Re: BatchUpdateException while trying to use WriteToJdbc

2023-02-21 Thread Chamikara Jayalath via user
a bigger source data payload. > Does it look like multiple threads trying to acquire a write lock to the > DB table(Oracle table)? > > *Thanks and Regards,* > > *Varun Rauthan* > > > > > On Wed, Feb 22, 2023 at 1:23 AM Chamikara Jayalath via user < >

Re: BatchUpdateException while trying to use WriteToJdbc

2023-02-21 Thread Chamikara Jayalath via user
Which runner are you using ? Also, do you have the bottom of the StackTrace here ? It's possibly due to Docker containers running the Java SDK not having access to your database, but I'm not sure based on the information provided. Thanks, Cham On Tue, Feb 21, 2023 at 11:32 AM Somnath Chouwdhury

Re: ElasticsearchIO write to trigger a percolator

2023-02-02 Thread Chamikara Jayalath via user
On Thu, Feb 2, 2023 at 1:56 PM Kaggal, Vinod C. (Vinod C.), M.S. via user < user@beam.apache.org> wrote: > Hello! Thank you for all the hard work on implementing these useful > libraries. > > > > *Background:* We have been using Apache Storm in production for some time > (over 8 years) and have

[ANNOUNCE] Apache Beam 2.43.0 Released

2022-11-18 Thread Chamikara Jayalath via user
The Apache Beam team is pleased to announce the release of version 2.43.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release

Fwd: Cross Language

2022-10-12 Thread Chamikara Jayalath via user
+user -- Forwarded message - From: Chamikara Jayalath Date: Wed, Oct 12, 2022 at 9:38 AM Subject: Re: Cross Language To: phani geeth On Wed, Oct 12, 2022 at 12:25 AM phani geeth wrote: > Hi Chamikara, > > Thanks for replying. > > My use case is to write

Re: Cross Language

2022-10-11 Thread Chamikara Jayalath via user
Is there a specific I/O connector you are hoping to use ? Thanks, Cham On Tue, Oct 11, 2022 at 4:31 AM Alexey Romanenko wrote: > Yes, it’s possible though Java IO connector should support being used via > X-language. > > For more details regarding which connector supports this, you may want to

Re: [question] Good Course to learn beam

2022-08-31 Thread Chamikara Jayalath via user
Not necessarily a course but I would also highly recommend reading the Beam Programming Guide to learn about Beam. https://beam.apache.org/documentation/programming-guide/ I think it's well written and it comes with example code for all supported SDKs. Thanks, Cham On Wed, Aug 31, 2022 at 12:31

Re: PubSub Lite IO & Python?

2022-08-25 Thread Chamikara Jayalath via user
e worker CPU utilization, and the > pipeline having sufficiently low backlog and keeping up with input rate.", > which seems to me like it's just not really reading things correctly > > On Thu, Aug 25, 2022 at 8:44 PM Chamikara Jayalath > wrote: > >> Do you see any errors in Dataflo

Re: PubSub Lite IO & Python?

2022-08-25 Thread Chamikara Jayalath via user
gt;>> table_config=table_config >>>> ) >>>> >>> >>> On Fri, Aug 5, 2022 at 1:11 AM Austin Bennett < >>> whatwouldausti...@gmail.com> wrote: >>> >>>> @cham thanks for bringing the conversation back to th

Re: How to register as external cross language transform ?

2022-08-17 Thread Chamikara Jayalath via user
On Wed, Aug 17, 2022 at 3:05 AM Yu Watanabe wrote: > Hello. > > I am trying to write code for cross language transform for > ElasticsearchIO but having trouble with it. > I would appreciate it if I could get help. > > As describe in doc and also referencing KafkaIO , > > >

Re: PubSub Lite IO & Python?

2022-08-04 Thread Chamikara Jayalath via user
ned that there just wasn't anything to access. We could definitely >> have been wrong about that but it wasn't clear how to move forward so we >> just switched our focus to writing native Spark code to pull from PubSub >> Lite >> >> On Thu, Aug 4, 2022 at 6:46 PM Chamika

Re: PubSub Lite IO & Python?

2022-08-04 Thread Chamikara Jayalath via user
I believe this should be fully working. I'm not familiar with PyBeam though. Is the execution mechanism the same as running a regular Beam pipeline ? Also, note that for multi-language, you need to use a portable Beam runner. +Daniel Collins who implemented this. Thanks, Cham On Thu, Aug 4,

Re: How to configure external service for Kafka IO to run the flink job in k8s

2022-07-21 Thread Chamikara Jayalath via user
supports the environment type/config you specify in the expansion service. Thanks, Cham On Mon, Jul 18, 2022 at 2:54 PM Ahmet Altay wrote: > Adding a few relevant folks who could help answer this question: @John > Casey @Chamikara Jayalath > @Robert > Bradshaw > > Lydia

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

2022-07-20 Thread Chamikara Jayalath via user
On Wed, Jul 20, 2022 at 12:57 PM Chamikara Jayalath wrote: > I don't think it's an antipattern per se. You can implement arbitrary > operations in a DoFn or an SDF to read data. > > But if a single resource ID maps to a large amount of data, Beam runners > (including Dataflo

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

2022-07-20 Thread Chamikara Jayalath via user
I don't think it's an antipattern per se. You can implement arbitrary operations in a DoFn or an SDF to read data. But if a single resource ID maps to a large amount of data, Beam runners (including Dataflow) will be able to parallelize reading, hence your solution may have suboptimal performance

Re: Implementing a custom I/O Connector

2022-07-14 Thread Chamikara Jayalath via user
; > > NB > - I did some editing on the notebook so the original revision is here > <https://colab.research.google.com/drive/1ljtoEtyG0gwbq6SPTY1EHHmpdh6EHuxu#scrollTo=vaXnHuVOtEdG> > > On Thu, Jul 14, 2022 at 10:15 PM Chamikara Jayalath via user < > user@beam.apache.org>

Re: Implementing a custom I/O Connector

2022-07-14 Thread Chamikara Jayalath via user
Do you have the full stacktrace ? Also, what does the Read() transform in the example entail ? Thanks, Cham On Thu, Jul 14, 2022 at 7:39 AM Damian Akpan wrote: > Hi Everyone, > > I've been working on implementing a Google Sheets IO source for my > pipeline. I've tried this example >

Re: Any guideline for building golang connector ?

2022-07-11 Thread Chamikara Jayalath via user
Strong +1 for using x-lang instead if re-implementing the ElasticSearch connector in Go. Thanks, Cham On Fri, Jul 8, 2022 at 5:16 AM Yu Watanabe wrote: > Hello Danny. > > Thank you for the details. I appreciate your message. > > I am a newbie around building io . So I will look into the links

Re: Debugging External Transforms on Dataflow (Python)

2021-06-15 Thread Chamikara Jayalath
On Tue, Jun 15, 2021 at 3:20 AM Alex Koay wrote: > Several questions: > > 1. Is there any way to set the log level for the Java workers via a Python > Dataflow pipeline? > > 2. What is the easiest way to debug an external transform in Java? My main > pipeline code is in Python. > In general,

Re: Runner support for cross-language transforms, Flink, Dataflow

2021-06-11 Thread Chamikara Jayalath
On Fri, Jun 11, 2021 at 9:49 AM Reynaldo Baquerizo < reynaldo.michel...@bairesdev.com> wrote: > > Hi, > > What’s the state of runners support for cross-language transforms? > Is this roadmap up to date > https://beam.apache.org/roadmap/connectors-multi-sdk ? > > I’m looking for documentation on

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Chamikara Jayalath
gt;> >> As an aside, I put together an ExternalTransform for MqttIO which you can >> find here: >> https://gist.github.com/alexkoay/df35eb12bc2afd8f502ef13bc915b33c >> I can confirm that it works in batch mode, but given that I couldn't get >> Kafka to work with Dataflow, I don

Re: Issues running Kafka streaming pipeline in Python

2021-06-02 Thread Chamikara Jayalath
wrote: > /cc @Boyuan Zhang for kafka @Chamikara Jayalath > for multi language might be able to help. > > On Tue, Jun 1, 2021 at 9:39 PM Alex Koay wrote: > >> Hi all, >> >> I have created a simple snippet as such: >> >> import apache_beam as bea

Re: retaining filename during file processing

2021-05-20 Thread Chamikara Jayalath
On Wed, May 19, 2021 at 6:28 PM wrote: > I'm writing app that processing an unbound stream of filenames and then > catalogs them. What I'd like to do is to parse the files using AvroIO, but > have each record entry paired with the original filename as a key. > > In the past I've used the combo

Re: Beam 2.29.0 throwsing warning/error when reading by using BigQuery Storage Read API

2021-05-18 Thread Chamikara Jayalath
Filip, Could you create a Jira with steps to reproduce ? +Kenneth Jung Thanks, Cham On Mon, May 17, 2021 at 10:29 AM Filip Popić wrote: > Hi, > > I am trying to read data from BigQuery table using BigQuery Storage/Direct > Read API

Re: Potential bug with Kafka (or other external) IO in Python Portable Runner

2021-05-06 Thread Chamikara Jayalath
> Nir > > On Wed, May 5, 2021 at 12:07 AM Chamikara Jayalath > wrote: > >> When you use cross-language Java transforms from Python we use the >> default environment for Java transforms which always gets set to Docker. >> >> https://github.com/apache/beam/bl

Re: IllegalStateException with simple Kafka Pipeline

2021-05-06 Thread Chamikara Jayalath
On Thu, May 6, 2021 at 9:53 AM Nir Gazit wrote: > Hey, > I'm trying to run a pipeline with the Python SDK that reads from Kafka. > I've started with a simple one that just reads messages and prints them to > the console. When running on Flink, I get the following error: > File "kafka_print.py",

Re: Potential bug with Kafka (or other external) IO in Python Portable Runner

2021-05-04 Thread Chamikara Jayalath
am/runners/fnexecution/control/DefaultJobBundleFactory.java#L252>) > if it got that in PipelineOptions? > > Otherwise, how can I have docker on a k8s pod? I couldn't seem to find any > examples for that and saw that it's highly unrecommended. > > Thanks! > Nir > > On Tue, May 4, 2021

Re: Potential bug with Kafka (or other external) IO in Python Portable Runner

2021-05-04 Thread Chamikara Jayalath
docker on the pods so I don’t want to use the docker environment. > That’s why I specified EXTERNAL environment in PipelineOptions. However, it > seems that it doesn’t get propagated. > > On Tue, 4 May 2021 at 20:59 Chamikara Jayalath > wrote: > >> Is it possible that you d

Re: Potential bug with Kafka (or other external) IO in Python Portable Runner

2021-05-04 Thread Chamikara Jayalath
Is it possible that you don't have the "docker" command available in your system ? On Tue, May 4, 2021 at 10:28 AM Nir Gazit wrote: > Hey, > I'm trying to run a pipeline with a Kafka Source, using an EXTERNAL > environment. However, when the pipeline is run, the error below is thrown, > which

Re: Dataflow - Kafka error

2021-04-02 Thread Chamikara Jayalath
Great :) On Fri, Apr 2, 2021 at 3:01 AM Maria-Irina Sandu wrote: > I was able to submit the job after specifying the runner. Thanks a lot! > >>

Re: Dataflow - Kafka error

2021-03-31 Thread Chamikara Jayalath
gt;> 'alt-svc': 'h3-29=":443"; ma=2592000,h3-T051=":443"; >> ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; >> ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"', >> 'tran

Re: Dataflow - Kafka error

2021-03-30 Thread Chamikara Jayalath
-tabpanel#comment-17305920 On Tue, Mar 30, 2021 at 1:23 PM Brian Hulette wrote: > +Chamikara Jayalath > > Could you try with beam 2.27.0 or 2.28.0? I think that this PR [1] may > have addressed the issue. It avoids the problematic code when the pipeline > is multi-language [2]

Re: [Question] -- Getting error while writing data into Big Query from Dataflow -- "Clients have non-trivial state that is local and unpickleable.", _pickle.PicklingError: Pickling client objects is e

2021-03-29 Thread Chamikara Jayalath
This might be relevant: https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors If "save_main_session" is set Dataflow tries to pickle the main session. So you might have to define such objects locally (for example, within functions, DoFn classes, etc.) or update the

Re: read from kafka: cannot encode a null byte[]

2021-03-26 Thread Chamikara Jayalath
On Fri, Mar 26, 2021 at 12:11 AM yilun zhang wrote: > Hey, > > We are using Docker mode to run Kafka with beam python SDK and encounter > an error: > > Caused by: org.apache.beam.sdk.coders.CoderException: cannot encode a null > byte[] > at

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Chamikara Jayalath
all sinks. Any contributions related to this are welcome. Thanks, Cham > >> >> On Wed, Mar 24, 2021 at 10:00 AM Chamikara Jayalath >> wrote: >> >>> If you want to wait for all records are written (per window) to >>> Cassandra before writing that windo

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Chamikara Jayalath
If you want to wait for all records are written (per window) to Cassandra before writing that window to PubSub, you should be able to use the Wait transform: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Wait.java Thanks, Cham On Wed, Mar

Re: Apache Beam Python SDK ReadFromKafka does not receive data

2021-03-17 Thread Chamikara Jayalath
question, did you launch your pipeline with streaming=True >>> pipeline >>> options? I think you need to use --streaming=True to have unbounded >>> source working properly. >>> >>> On Tue, Mar 16, 2021 at 9:41 AM Boyuan Zhang wrote: >>> >>>

Re: Apache Beam Python SDK ReadFromKafka does not receive data

2021-03-16 Thread Chamikara Jayalath
aviour with DirectRunner as well BTW. > > ..Sumeet > > On Tue, Mar 16, 2021 at 12:00 PM Chamikara Jayalath > wrote: > >> I'm not too familiar with Flink but it seems like, for streaming >> pipelines, messages from Kafka/SDF read do not get pushed to subsequent >> steps

Re: Apache Beam Python SDK ReadFromKafka does not receive data

2021-03-16 Thread Chamikara Jayalath
.withBootstrapServers(options.getBootstrap()) >> .withTopic(options.getOutputTopic()) >> >> .withKeySerializer(org.apache.kafka.common.serialization.StringSerializer.class) >> >> .withValueSerializer(org.apache.kafka.common.serialization.Str

Re: Apache Beam Python SDK ReadFromKafka does not receive data

2021-03-09 Thread Chamikara Jayalath
Also can you try sending messages back to Kafka (or another distributed system like GCS) instead of just printing them ? (given that multi-language pipelines run SDK containers in Docker you might not see prints in the original console I think). Thanks, Cham On Tue, Mar 9, 2021 at 10:26 AM

[ANNOUNCE] Beam 2.28.0 Released

2021-02-22 Thread Chamikara Jayalath
The Apache Beam team is pleased to announce the release of version 2.28.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release

Re: Regarding Beam 2.28 release timeline

2021-02-17 Thread Chamikara Jayalath
Currently the first release candidate is being validated. Please see here for updates: https://lists.apache.org/thread.html/r247d51726a8f330400468db0f69b3531e47525952fb6af5d8614%40%3Cdev.beam.apache.org%3E Thanks, Cham On Wed, Feb 17, 2021 at 2:02 PM Tao Li wrote: > Hi Beam community, > >

Re: Publish on Pubsub with ordering keys

2021-02-09 Thread Chamikara Jayalath
Beam currently does not support this feature. Please see here for some context (with regard to Beam+Dataflow): https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8 Thanks, Chamikara On Mon, Feb 8, 2021 at 5:13 PM Hemali Sutaria wrote: > > >

Re: Using the Beam Python SDK and PortableRunner with Flink to connect to Kafka with SSL

2021-02-02 Thread Chamikara Jayalath
hanks again. >>> >>> On Tue, Feb 2, 2021 at 9:08 PM Kyle Weaver wrote: >>> >>>> AFAIK sdk_harness_container_image_overrides only works for the Dataflow >>>> runner. For other runners I think you will have to change the default >>>> env

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Chamikara Jayalath
Sounds like a bug. I think JIRA with a test case will still be helpful. On Fri, Jan 29, 2021 at 2:33 PM Tao Li wrote: > @Chamikara Jayalath Sorry about the confusion. But > I did more testing and using the spark runner actually yields the same > error: > > > > java.la

Re: Potential bug with ParquetIO.read when reading arrays

2021-01-29 Thread Chamikara Jayalath
Thanks. It might be something good to document in case other users run into this as well. Can you file a JIRA with the details ? On Fri, Jan 29, 2021 at 10:47 AM Tao Li wrote: > OK I think this issue is due to incompatibility between the parquet files > (created with spark 2.4) and parquet

Re: Overwrite support from ParquetIO

2021-01-28 Thread Chamikara Jayalath
On Thu, Jan 28, 2021 at 9:14 AM Alexey Romanenko wrote: > 1. Personally, I’d recommend to purge the output directory (if it’s > needed, of course) before starting your pipeline as a part of your driver > program and not in DoFn since, as Reuven mentioned before, to avoid > potential side

Re: Overwrite support from ParquetIO

2021-01-27 Thread Chamikara Jayalath
On Wed, Jan 27, 2021 at 12:06 PM Tao Li wrote: > @Alexey Romanenko thanks for your response. > Regarding your questions: > > > >1. Yes I can purge this directory (e.g. using s3 client from aws sdk) >before using ParquetIO to save files. The caveat is that this deletion >operation is

Re: snowflake io in python

2020-11-23 Thread Chamikara Jayalath
dataflow until then? > Is there a timeline for that work to be completed? > > Thank you so much for your help > > > > On Fri, Nov 20, 2020 at 9:33 PM Chamikara Jayalath > wrote: > >> This is because Python does not understand that Java specific >> &quo

Re: Documentation for Cross-Language Transforms

2020-11-20 Thread Chamikara Jayalath
PR went in and documentation is live now: https://beam.apache.org/documentation/programming-guide/#mulit-language-pipelines Thanks, Cham On Wed, Nov 18, 2020 at 10:05 AM Chamikara Jayalath wrote: > This was mentioned in a separate thread but thought it would be good to > highlight here i

Re: snowflake io in python

2020-11-20 Thread Chamikara Jayalath
2020 at 11:14 AM Brian Hulette wrote: > +Chamikara Jayalath any idea why this is still > doing a runner api roundtrip and failing? It's a multi-language pipeline, > and Alan has it configured to run on Dataflow runner V2. > > On Fri, Nov 20, 2020 at 10:36 AM Alan Krumholz

Re: Partition unbounded collection like Kafka source

2020-07-27 Thread Chamikara Jayalath
Probably you should apply the Partition[1] transform on the output PCollection of your read. Note though that the exact parallelization is runner dependent (for example, runner might autoscale up resulting in more writers). Did you run into issues when just reading from Kafka and writing to

Re: ReadFromKafka returns error - RuntimeError: cannot encode a null byte[]

2020-07-22 Thread Chamikara Jayalath
hank you for the prompt responses and looking forward to using this >> feature in the future. >> >> Regards, >> Ayush. >> >> On Sat, Jul 18, 2020 at 3:14 PM Chamikara Jayalath >> wrote: >> >>> >>> >>> On Fri, Jul 17, 2020 at 10:04 PM

Re: ReadFromKafka returns error - RuntimeError: cannot encode a null byte[]

2020-07-18 Thread Chamikara Jayalath
reaming. > I'm adding an example but I've only tested this with Dataflow yet. I hope to test that example for more runners and add additional instructions there. https://github.com/apache/beam/pull/12188 Thanks, Cham [1] https://beam.apache.org/roadmap/connectors-multi-sdk/ > >

Re: ReadFromKafka returns error - RuntimeError: cannot encode a null byte[]

2020-07-17 Thread Chamikara Jayalath
Luke Cwik wrote: > >> +Heejong Lee +Chamikara Jayalath >> >> >> Do you know if your trial record has an empty key or value? >> If so, then you hit a bug and it seems as though there was a miss >> supporting this usecase. >> >> Heejong and Cham, >

Re: FileIO write to new folder every hour.

2020-07-15 Thread Chamikara Jayalath
On Tue, Jul 14, 2020 at 10:38 PM Almeida, Julius wrote: > Hi Team, > > > > Thanks Mohil, that seems to be working. > > > > I would also like to control the size of file I write to, is there a way > to achieve this in beam. > > > > I am currently using this, but not sure if this works. > > > >

Re: ReadFromKafka: UnsupportedOperationException: The ActiveBundle does not have a registered bundle checkpoint handler

2020-07-15 Thread Chamikara Jayalath
+1 for prioritizing this. On Thu, Jul 9, 2020 at 8:08 AM Piotr Filipiuk wrote: > Thank you for looking into this. > > I upvoted the BEAM-6868 . > Is there anything else I can do to have that feature prioritized, other > than trying to contribute

Re: Continuous Read pipeline

2020-06-23 Thread Chamikara Jayalath
On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH wrote: > Hi, > > I am using the Python SDK with Dataflow as my runner. I am looking at > implementing a streaming pipeline that will continuously monitor a GCS > bucket for incoming files and depending on the regex of the file, launch a > set of

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-18 Thread Chamikara Jayalath
nt_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":370,"grpc_status":14,"referenced_errors":[{"created":"@1591997030.61384","description":"C-ares >>> status is not ARES_SUCCESS: Misformatted domain >

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Chamikara Jayalath
To clarify, Kafka dependency was already available as an embedded dependency in Java SDK Harness but not sure if this worked for DirectRunner. starting 2.22 we'll be staging dependencies from the environment during pipeline submission. On Mon, Jun 8, 2020 at 3:23 PM Chamikara Jayalath wrote

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Chamikara Jayalath
; Thank you for the suggestions. >> >> Neither Kafka nor Flink run in a docker container, they all run locally. >> Furthermore, the same issue happens for Direct Runner. That being said >> changing from localhost:$PORT to $IP_ADDR:$PORT resulted in a different >> err

Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-05 Thread Chamikara Jayalath
Is it possible that "'localhost:9092'" is not available from the Docker environment where the Flink step is executed from ? Can you try specifying the actual IP address of the node running the Kafka broker ? On Fri, Jun 5, 2020 at 2:53 PM Luke Cwik wrote: > +dev +Chami

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-06-01 Thread Chamikara Jayalath
Great. Thanks. On Mon, Jun 1, 2020 at 9:14 AM Alexey Romanenko wrote: > Yes, I tested it with the cross-language transform (Java pipeline with > Python external transform). > > On 1 Jun 2020, at 17:49, Chamikara Jayalath wrote: > > To clarify, is the error resolved with

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-06-01 Thread Chamikara Jayalath
To clarify, is the error resolved with the cross-language transform as well ? If not please file a Jira. On Mon, Jun 1, 2020 at 8:24 AM Kyle Weaver wrote: > > It would be useful to print out such errors with Error level log, I > think. > > I agree, using environment_type=PROCESS is difficult

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-28 Thread Chamikara Jayalath
This might have to do with https://github.com/apache/beam/pull/11670. +Lukasz Cwik was there a subsequent fix that was not included in the release ? On Thu, May 28, 2020 at 10:29 AM Kyle Weaver wrote: > What source are you using? > > On Thu, May 28, 2020 at 1:24 PM Alexey Romanenko > wrote: >

Re: Behavior of KafkaIO

2020-05-11 Thread Chamikara Jayalath
The number of partitions assigned to a given split depends on the desiredNumSplits value provided by the runner. https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedSource.java#L54 (This is assuming that you are using Beam Kafka

Re: Optimize Read from BQ in Streaming Mode

2020-04-23 Thread Chamikara Jayalath
Current BQ source is a bounded source. Basically you are reading a SNAPSHOT of a BQ table at a given point in time. It's possible to use the BQ source (and any other bounded source) from a streaming pipeline. This will result in an automatic bounded to unbounded converter being invoked that

Re: Kafka IO: value of expansion_service

2020-04-21 Thread Chamikara Jayalath
On Tue, Apr 21, 2020 at 12:43 PM Piotr Filipiuk wrote: > Hi, > > I would like to know whether it is possible to run a streaming pipeline > that reads from (or writes to) Kafka using DirectRunner? If so, what should > the expansion_service point to: >

Re: Reading BigQuery table containing a repeated field into POJOs

2020-04-20 Thread Chamikara Jayalath
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner$DoFnProcessContext.output > (SimpleDoFnRunner.java:615) > at > org.apache.beam.sdk.io.gcp.bigquery.PassThroughThenCleanup$IdentityFn.processElement > (PassThroughThenCleanup.java:83) > > On Sat, 18 Apr 2020, at

Re: Reading BigQuery table containing a repeated field into POJOs

2020-04-17 Thread Chamikara Jayalath
Do you have the full stack trace ? Also, does readTableRows() work for you (without using schemas) ? On Fri, Apr 17, 2020 at 3:44 AM Joshua Bassett wrote: > Hi there > > I'm trying to read rows from a BigQuery table that contains a repeated > field into POJOs. Unfortunately, I'm running into

Re: WriteToBigquery on DataflowRunner with --experiment=beam_fn_api: B cannot be cast to org.apache.beam.sdk.values.KV

2020-01-13 Thread Chamikara Jayalath
On Mon, Jan 13, 2020 at 12:45 PM Bo Shi wrote: > Python 3.7.5 > Beam 2.17 > > I've used both WriteToBigquery and BigQueryBatchFileLoads successfully > using DirectRunner. I've boiled the issue down to a small > reproducible case (attached). The following works great: > > $ pipenv run python

Re: Assertion error when using kafka module in python

2020-01-08 Thread Chamikara Jayalath
Hmm, seems like a Java (external) ParDo is being forwarded to Python SDK for execution somehow. +Maximilian Michels might know more. On Sun, Jan 5, 2020 at 2:57 AM Yu Watanabe wrote: > Hello. > > I would like to use sinking data into kafka using kafka module for > python , however, > getting

Re: JDBCIO reader + BigQuery writer extremly slow due to bundle size = 1

2020-01-02 Thread Chamikara Jayalath
On Thu, Jan 2, 2020 at 10:10 AM Konstantinos P. wrote: > Hi! > > I have setup a beam pipeline to read from a postGreSQL server and write to > BigQuery table and it takes for ever even for size of 20k records. While > investingating the process, I found that the pipeline created one file per >

Re: How to store offset with kafkaio

2019-12-04 Thread Chamikara Jayalath
I assume you meant Kafka offset - https://kafka.apache.org/documentation/#intro_topics Currently I don't think this is possible due to two reasons. (1) Currently Kafka source can either read from a given topic or a set of topic partitions, but not from a given offset -

Re: The state of external transforms in Beam

2019-11-08 Thread Chamikara Jayalath
Send https://github.com/apache/beam/pull/10054 to update the roadmap. Thanks, Cham On Mon, Nov 4, 2019 at 10:24 AM Chamikara Jayalath wrote: > Makes sense. > > I can look into expanding on what we have at following location and adding > links to some of the existing work as

Re: The state of external transforms in Beam

2019-11-04 Thread Chamikara Jayalath
https://jira.apache.org/jira/browse/BEAM-7870 > > It would be great to have cross-language current state mentioned as top > level entry on https://beam.apache.org/roadmap/ > > > On Mon, Sep 16, 2019 at 6:07 PM Chamikara Jayalath > wrote: > >> Thanks for the nice write up Chad. &g

Re: No filesystem found for scheme s3 using FileIO

2019-09-24 Thread Chamikara Jayalath
As Magnus mentioned, FileSystems are picked up from the class path and registered here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L480 Seems like Flink is invoking this method at following locations.

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Chamikara Jayalath
Are you specifying the number of shards to write to: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L859 If so, this will incur an additional shuffle to re-distribute data written by all workers into the given number of shards before

Re: The state of external transforms in Beam

2019-09-16 Thread Chamikara Jayalath
Thanks for the nice write up Chad. On Mon, Sep 16, 2019 at 12:17 PM Robert Bradshaw wrote: > Thanks for bringing this up again. My thoughts on the open questions below. > > On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova wrote: > > That commit solves 2 problems: > > > > Adds the pubsub Java

Re: AvroIO Windowed Writes - Number of files to specify

2019-09-12 Thread Chamikara Jayalath
peline.applyInternal(Pipeline.java:537) >>> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488) >>> at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370) >>> >> >> >> >> Best >> Ziyad >> >> >

Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

2019-09-04 Thread Chamikara Jayalath
+Pablo Estrada who added this. I don't think we have tested this specific option but I believe additional BQ parameters option was added in a generic way to accept all additional parameters. Looking at the code, seems like additional parameters do get passed through to load jobs:

Re: AvroIO Windowed Writes - Number of files to specify

2019-09-04 Thread Chamikara Jayalath
Do you mean the value to specify for number of shards to write [1] ? For this I think it's better to not specify any value which will give the runner the most flexibility. Thanks, Cham [1]

Re: Python FileBasedSource supporting gzip

2019-08-19 Thread Chamikara Jayalath
you want to connect to all file-systems supported by Beam (currently GCS, HDFS, and local) in an abstract way, you can use the filesystems API: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py Thanks, Cham > > Cheers > Oliver > > On Fri, Aug 1

Re: Portability framework: multiple environments in one pipeline

2019-07-23 Thread Chamikara Jayalath
ould be adding documentation around current solution in the future. Thanks, Cham > > -chad > > > > On Tue, Jul 23, 2019 at 1:39 PM Chamikara Jayalath > wrote: > >> I think we have primary focussed on the ability run transforms from >> multiple SDK in the same pipe

Re: Portability framework: multiple environments in one pipeline

2019-07-23 Thread Chamikara Jayalath
I think we have primary focussed on the ability run transforms from multiple SDK in the same pipeline (cross-language) so far, but as Robert mentioned the framework currently in development should also be usable for running pipelines that use multiple environments that have the same SDK installed

Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Chamikara Jayalath
hen looking at > transitioning from Dataproc to Dataflow using GCS as storage buffer instead > of a traditional hdfs. > > From what I've been able to tell from source code and documentation, Java > is able to but not Python? > > Thanks, > Shannon > > On Mon, Jul 1, 2

Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Chamikara Jayalath
I don't think we have a source/sink for reading Hadoop sequence files. Your best bet currently will probably be to use FileSystem abstraction to create a file from a ParDo and read directly from there using a library that can read sequence files. Thanks, Cham On Mon, Jul 1, 2019 at 8:42 AM

  1   2   >