How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Yu Watanabe
Hello. I would like to ask for help with my sample code using portable runner using apache flink. I was able to work out the wordcount.py using this page. https://beam.apache.org/roadmap/portability/ I got below two files under /tmp. -rw-r--r-- 1 ywatanabe ywatanabe185 Sep 12 19:56

Re: AvroIO Windowed Writes - Number of files to specify

2019-09-12 Thread Ziyad Muhammed
Hi Cham Any update on this? Best Ziyad On Thu, Sep 5, 2019 at 5:43 PM Ziyad Muhammed wrote: > Hi Cham > > I tried that before. Apparently it's not accepted by either direct runner > or dataflow runner. I get the below error: > > Exception in thread "main" java.lang.IllegalArgumentException:

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Lukasz Cwik
When you use a local filesystem path and a docker environment, "/tmp" is written inside the container. You can solve this issue by: * Using a "remote" filesystem such as HDFS/S3/GCS/... * Mounting an external directory into the container so that any "local" writes appear outside the container *

How can I work with multiple pcollections?

2019-09-12 Thread Steve973
I am new to Beam, and I am pretty excited to get started. I have been doing quite a bit of research and playing around with the API. But for my use case, unless I am not approaching it correctly, suggests that I will need to process multiple PCollections in some parts of my pipeline. I am

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Kyle Weaver
+dev I think we should probably point new users of the portable Flink/Spark runners to use loopback or some other non-docker environment, as Docker adds some operational complexity that isn't really needed to run a word count example. For example, Yu's pipeline errored here because the expected

Re: How can I work with multiple pcollections?

2019-09-12 Thread Lukasz Cwik
Yes you can create multiple output PCollections using a ParDo with multiple outputs instead of inserting them into Mongo. It could be useful to read through the programming guide related to PCollections[1] and PTransforms with multiple outputs[2] and feel free to return with more questions. 1:

Re: Python errors when using batch+windows+textio

2019-09-12 Thread Kyle Weaver
Hi Pawel, could you tell us which version of the Beam Python SDK you are using? For the record, this looks like a known issue: https://issues.apache.org/jira/browse/BEAM-6860 Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 11, 2019 at 6:33 AM Paweł Kordek

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Thomas Weise
This should become much better with 2.16 when we have the Docker images prebuilt. Docker is probably still the best option for Python on a JVM based runner in a local environment that does not have a development setup. On Thu, Sep 12, 2019 at 1:09 PM Kyle Weaver wrote: > +dev I think we

Re: AvroIO Windowed Writes - Number of files to specify

2019-09-12 Thread Chamikara Jayalath
I'm bit confused since we mention https://issues.apache.org/jira/browse/BEAM-1438 before that error but that JIRA has been fixed a few years ago. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L312 +Reuven Lax can you comment on

Re: Beam/flink/kubernetes/minikube/wordcount example

2019-09-12 Thread Austin Bennett
I got hung up on that issue earlier this week. Was using Flink 1.7. V2.15 of Beam. Wasn't using Kubernetes. Then gave up, so don't have a solution :-/ I don't understand the job server enough, but think I was getting error when I did not have it running (I still don't understand portability

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Kyle Weaver
I prefer loopback because a) it writes output files to the local filesystem, as the user expects, and b) you don't have to pull or build docker images, or even have docker installed on your system -- which is one less point of failure. Kyle Weaver | Software Engineer | github.com/ibzib |

Re: How to buffer events using spark portable runner ?

2019-09-12 Thread Yu Watanabe
Lukasaz Thank you for the reply. I will try apache flink. Thanks, Yu On Sun, Sep 8, 2019 at 11:59 PM Lukasz Cwik wrote: > Try using Apache Flink. > > On Sun, Sep 8, 2019 at 6:23 AM Yu Watanabe wrote: > >> Hello . >> >> I would like to ask question related to timely processing as stated in

How to debug dataflow locally

2019-09-12 Thread deepak kumar
Hi All, I am trying to come up with a framework that would help debugging dataflow job end to end locally. Dataflow can work with 100s of source and sinks. Is there a framework already to setup these sources and sinks locally e.g. if my dataflow reads from BQ and inserts to Bigtable. Please let me