Re: Using Beam Built-in I/O Transforms with an external framework.

2019-09-18 Thread Pulasthi Supun Wickramasinghe
Hi Chamikara, Chad Thanks for your replies. @Chamikara I am already working on a Beam runner for Twister2, the runner is functional for the most part even though I have not fully tested it. What I was thinking was to find a way to transform just the sources using the Read primitive that is used

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-18 Thread Ahmet Altay
I believe the flag was never relevant for PortableRunner. I might be wrong as well. The flag affects a few bits in the core code and that is why the solution cannot be by just setting the flag in Dataflow runner. It requires some amount of clean up. I agree that it would be good to clean this up,

Re: using avro instead of json for BigQueryIO.Write

2019-09-18 Thread Pablo Estrada
Thanks for offering to work on this! It would be awesome to have it. I can say that we don't have that for Python ATM. On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz wrote: > Our experience has actually been that avro is more efficient than even > parquet, but that might also be skewed from our

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Reuven Lax
I believe that the Total shuffle data process counter counts the number of bytes written to shuffle + the number of bytes read. So if you shuffle 1GB of data, you should expect to see 2GB on the counter. On Wed, Sep 18, 2019 at 2:39 PM Shannon Duncan wrote: > Ok just ran the job on a small

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
Sorry missed a part of the map output for flatten: [image: image.png] However the shuffle does show only 29.32 GB going into it but the output of Total Shuffled data is 58.66 GB [image: image.png] On Wed, Sep 18, 2019 at 4:39 PM Shannon Duncan wrote: > Ok just ran the job on a small input

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
Ok just ran the job on a small input and did not specify numShards. so it's literally just: .apply("WriteLines", TextIO.write().to(options.getOutput())); Output of map for join: [image: image.png] Details of Shuffle: [image: image.png] Reported Bytes Shuffled: [image: image.png] On Wed, Sep

Re: Using Beam Built-in I/O Transforms with an external framework.

2019-09-18 Thread Chad Dombrova
Hi Pulasthi, Just to mirror what Cham said, it would be a non-starter to try to use a Beam IO source in another framework: to make them work, you'd have to build something that executes them with their expected protocol, and that would look an awful lot like a Beam runner. It makes more sense to

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Reuven Lax
On Wed, Sep 18, 2019 at 2:12 PM Shannon Duncan wrote: > I will attempt to do without sharding (though I believe we did do a run > without shards and it incurred the extra shuffle costs). > It shouldn't. There will be a shuffle, but that shuffle should contain a small amount of data (essentially

Re: Using Beam Built-in I/O Transforms with an external framework.

2019-09-18 Thread Chamikara Jayalath
Hi Pulasthi, This might be possible but I don't know if anybody has done this. API of Beam sources are no different from other Beam PTransforms and we highly recommend hiding away various implementations of source framework related abstractions in a composite transform [1]. So what you are

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
I will attempt to do without sharding (though I believe we did do a run without shards and it incurred the extra shuffle costs). Pipeline is simple. The only shuffle that is explicitly defined is the shuffle after merging files together into a single PCollection (Flatten Transform). So it's a

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Reuven Lax
In that case you should be able to leave sharding unspecified, and you won't incur the extra shuffle. Specifying explicit sharding is generally necessary only for streaming. On Wed, Sep 18, 2019 at 2:06 PM Shannon Duncan wrote: > batch on dataflowRunner. > > On Wed, Sep 18, 2019 at 4:05 PM

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
batch on dataflowRunner. On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax wrote: > Are you using streaming or batch? Also which runner are you using? > > On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan > wrote: > >> So I followed up on why TextIO shuffles and dug into the code some. It is >> using

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Chamikara Jayalath
Are you specifying the number of shards to write to: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L859 If so, this will incur an additional shuffle to re-distribute data written by all workers into the given number of shards before

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Reuven Lax
Are you using streaming or batch? Also which runner are you using? On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan wrote: > So I followed up on why TextIO shuffles and dug into the code some. It is > using the shards and getting all the values into a keyed group to write to > a single file. > >

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Jeff Klukas
What you propose with a writer per bundle is definitely possible, but I expect the blocker is that in most cases the runner has control of bundle sizes and there's nothing exposed to the user to control that. I've wanted to do similar, but found average bundle sizes in my case on Dataflow to be so

Re: Prevent Shuffling on Writing Files

2019-09-18 Thread Shannon Duncan
So I followed up on why TextIO shuffles and dug into the code some. It is using the shards and getting all the values into a keyed group to write to a single file. However... I wonder if there is way to just take the records that are on a worker and write them out. Thus not needing a shard number

Using Beam Built-in I/O Transforms with an external framework.

2019-09-18 Thread Pulasthi Supun Wickramasinghe
Hi Dev's We have a big data processing framework named Twister2, and wanted to know if there is any way we could leverage the I/O Transforms that are built into Apache Beam externally. That is rather than using it in a Beam pipeline just use them as data sources in our project. Just wanted to

Re: Beam Summit Videos in youtube

2019-09-18 Thread Maximilian Michels
Hi Rahul, The Beam Summit committee is working on this at the moment. Stay tuned. Thanks, Max On 18.09.19 11:39, rahul patwari wrote: Hi, The videos of Beam Summit that has happened recently have disappeared from YouTube Apache Beam channel. Is uploading the videos a WIP? Thanks, Rahul

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-18 Thread Maximilian Michels
I disagree that this flag is obsolete. It is still serving a purpose for batch users using dataflow runner and that is decent chunk of beam python users. It is obsolete for the PortableRunner. If the Dataflow Runner needs this flag, couldn't we simply add it there? As far as I know Dataflow

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread rahul patwari
Hi, I would love to join the call. Can you also share the meeting invitation with me? Thanks, Rahul On Wed 18 Sep, 2019, 11:48 PM Xinyu Liu, wrote: > Alexey and Etienne: I'm very happy to join the sync-up meeting. Please > forward the meeting info to me. I am based in California, US and

Beam Summit Videos in youtube

2019-09-18 Thread rahul patwari
Hi, The videos of Beam Summit that has happened recently have disappeared from YouTube Apache Beam channel. Is uploading the videos a WIP? Thanks, Rahul

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Xinyu Liu
Alexey and Etienne: I'm very happy to join the sync-up meeting. Please forward the meeting info to me. I am based in California, US and hopefully the time will work :). Thanks, Xinyu On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot wrote: > Hi Xinyu, > > Thanks for offering help ! My comments

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Etienne Chauchot
Hi Xinyu, Thanks for offering help ! My comments are inline: Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit : > Hi, Etienne, > The slides are very informative! Thanks for sharing the details about how the > Beam API are mapped into Spark > Structural Streaming. Thanks ! > We

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Etienne Chauchot
Hi Rui,Thanks for proposing to contribute to this new runner ! Here are the pointers:- SS runner branch: https://github.com/apache/beam/tree/spark-runner_structured-streaming- spark design doc for multiple watermarks support: