Robert,

Do you have any numbers by chance regarding this optimisation?

Alexey

> On 5 Oct 2021, at 00:27, Robert Bradshaw <rober...@google.com> wrote:
> 
> https://github.com/apache/beam/pull/15637 
> <https://github.com/apache/beam/pull/15637> might help some.
> 
> On Thu, Sep 9, 2021 at 5:21 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> Thanks Mike for this info!
> 
>  
> 
> From: Mike Kaplinskiy <m...@ladderlife.com <mailto:m...@ladderlife.com>>
> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Tuesday, September 7, 2021 at 2:15 PM
> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Cc: Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>>, Andrew Pilloud <apill...@google.com 
> <mailto:apill...@google.com>>, Ismaël Mejía <ieme...@gmail.com 
> <mailto:ieme...@gmail.com>>, Kyle Weaver <kcwea...@google.com 
> <mailto:kcwea...@google.com>>, Yuchu Cao <yuc...@trulia.com 
> <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> A long time ago when I was experimenting with the Spark runner for a batch 
> job, I noticed that a lot of time was spend in GC as well. In my case I 
> narrowed it down to how the Spark runner implements Coders. 
> 
>  
> 
> Spark's value prop is that it only serializes data when it truly has no other 
> choice - i.e. when it needs to reclaim memory or when it sends things over 
> the wire. Unfortunately due to the mismatch in serialization APIs between 
> Beam and Spark, Beam's Spark runner actually just serializes things all the 
> time. My theory was that the to/from byte array dance was slow. I attempted 
> to fix this at https://github.com/apache/beam/pull/8371 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fpull%2F8371&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187448677%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZtgDb0R3gjSHVU1rpp6T0ZVl7ZhXXRhH%2BqFMX8Z1z%2Bo%3D&reserved=0>
>  but I could never actually reproduce a speedup in performance benchmarks.
> 
>  
> 
> If you're feeling up to it, you could try reviving something like that PR and 
> see if it helps.
> 
>  
> 
> Mike.
> 
> Ladder 
> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbit.ly%2F1VRtWfS&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187458627%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=RBjmeAAqHdrZmXZEP7ONXwXZyLOwwx6tQbST%2Bs6wq2Q%3D&reserved=0>.
>  The smart, modern way to insure your life.
> 
>  
> 
>  
> 
> On Sat, Aug 14, 2021 at 4:35 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> @Alexey Romanenko <mailto:aromanenko....@gmail.com> I tried out ParquetIO 
> splittable and the processing time improved from 10 min to 6 min, but still 
> much longer than 2 min using a native spark app.
> 
>  
> 
> We are still seeing a lot of GC cost from below call stack. Do you think this 
> ticket can fix this issue https://issues.apache.org/jira/browse/BEAM-12646 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187458627%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Y4OpoFWLzBOf9Lfzg%2BBc%2ByTSsnIh%2FQVU4FSfrU93L%2F0%3D&reserved=0>
>  ? Thanks.
> 
>  
> 
> <image001.png>
> 
>  
> 
>  
> 
>  
> 
> From: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Friday, August 6, 2021 at 11:12 AM
> To: Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>>
> Cc: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>, Andrew Pilloud 
> <apill...@google.com <mailto:apill...@google.com>>, Ismaël Mejía 
> <ieme...@gmail.com <mailto:ieme...@gmail.com>>, Kyle Weaver 
> <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> Thanks @Alexey Romanenko <mailto:aromanenko....@gmail.com> please see my 
> clarifications below.
> 
>  
> 
>  
> 
> | “Well, of course, if you read all fields (columns) then you don’t need 
> column projection. Otherwise, it can give a quite significant performance 
> boost, especially for large tables with many columns. “
> 
>  
> 
> [Tao] Basically my perf testing was comparing beam spark runner and native 
> spark. In both the beam app and the native spark app, I was simply reading a 
> parquet backed dataset and immediately saving it back to parquet. And we were 
> seeing the beam app took 3-5 times longer than native spark. As I have shared 
> in this thread previously, below call stack from spark runner was quite time 
> consuming..
> 
>  
> 
> <image001.png>
> 
>  
> 
>  
> 
>  
> 
> | "Legacy Read transform (non-SDF based Read) is used by default for 
> non-FnAPI opensource runners. Use `use_sdf_read` experimental flag to 
> re-enable SDF based Read transforms 
> ([BEAM-10670](https://issues.apache.org/jira/browse/BEAM-10670 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-10670&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187468581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=flZziWvOW842LjMedT6Zgj8Xj23tAE43pUxqGe43aIE%3D&reserved=0>))”
> 
>  
> 
> [Tao] We are not specifying `use_sdf_read` experimental flag in our beam app, 
> so we are not using SDF translation.
> 
>  
> 
>  
> 
>  
> 
> From: Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>>
> Date: Friday, August 6, 2021 at 8:13 AM
> To: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Cc: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>, Andrew Pilloud 
> <apill...@google.com <mailto:apill...@google.com>>, Ismaël Mejía 
> <ieme...@gmail.com <mailto:ieme...@gmail.com>>, Kyle Weaver 
> <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
>  
> 
>  
> 
> On 5 Aug 2021, at 18:17, Tao Li <t...@zillow.com <mailto:t...@zillow.com>> 
> wrote:
> 
>  
> 
> It was a great presentation!
> 
>  
> 
> Thanks!
> 
>  
> 
>  Regarding my perf testing, I was not doing aggregation, filtering, 
> projection or joining. I was simply reading all the fields of parquet and 
> then immediately save PCollection back to parquet.
> 
>  
> 
> Well, of course, if you read all fields (columns) then you don’t need column 
> projection. Otherwise, it can give a quite significant performance boost, 
> especially for large tables with many columns. 
> 
>  
> 
>  
> 
> Regarding SDF translation, is it enabled by default?
> 
>  
> 
> From Beam 2.30.0 release notes:
> 
>  
> 
> "Legacy Read transform (non-SDF based Read) is used by default for non-FnAPI 
> opensource runners. Use `use_sdf_read` experimental flag to re-enable SDF 
> based Read transforms 
> ([BEAM-10670](https://issues.apache.org/jira/browse/BEAM-10670 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-10670&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187468581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=flZziWvOW842LjMedT6Zgj8Xj23tAE43pUxqGe43aIE%3D&reserved=0>))”
> 
>  
> 
> —
> 
> Alexey
> 
>  
> 
>  I will check out ParquetIO splittable. Thanks!
> 
>  
> 
> From: Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>>
> Date: Thursday, August 5, 2021 at 6:40 AM
> To: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Cc: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>, Andrew Pilloud 
> <apill...@google.com <mailto:apill...@google.com>>, Ismaël Mejía 
> <ieme...@gmail.com <mailto:ieme...@gmail.com>>, Kyle Weaver 
> <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> It’s very likely that Spark SQL may have much better performance because of 
> SQL push-downs and avoiding additional ser/deser operations.
> 
>  
> 
> In the same time, did you try to leverage "withProjection()” in ParquetIO and 
> project only the fields that you needed? 
> 
>  
> 
> Did you use ParquetIO splittable (it's not enabled by default, fixed in [1])?
> 
>  
> 
> Also, using SDF translation for Read on Spark Runner can cause performance 
> degradation as well (we noticed that in our experiments). Try to use non-SDF 
> read (if not yet) [2]
> 
>  
> 
>  
> 
> PS: Yesterday, on Beam Summit, we (Ismael and me) gave a related talk. I’m 
> not sure if a recording is already available but you can find the slides here 
> [3] that can be helpful.
> 
>  
> 
>  
> 
> —
> 
> Alexey
> 
>  
> 
> [1] https://issues.apache.org/jira/browse/BEAM-12070 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12070&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187478539%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7SR8gicpWaWcvsTvbdlJmPIYnLx6FAY%2FPD2w3ZcDgr4%3D&reserved=0>
> [2] https://issues.apache.org/jira/browse/BEAM-10670 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-10670&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187478539%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=aeqziE3fpMhlX%2Bv6P9WNv7Zo8wN6V1KPMZwZoIDIEqg%3D&reserved=0>
> [3] 
> https://drive.google.com/file/d/17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O/view?usp=sharing
>  
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O%2Fview%3Fusp%3Dsharing&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187478539%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=R54%2FpJjYkpUgmqx1OLTKH%2BWpKc%2FqwtXWUs6Ccfp8%2F2g%3D&reserved=0>
>  
> 
>  
> 
> On 5 Aug 2021, at 03:07, Tao Li <t...@zillow.com <mailto:t...@zillow.com>> 
> wrote:
> 
>  
> 
> @Alexey Romanenko <mailto:aromanenko....@gmail.com> @Ismaël Mejía 
> <mailto:ieme...@gmail.com> I assume you are experts on spark runner. Can you 
> please take a look at this thread and confirm this jira covers the causes 
> https://issues.apache.org/jira/browse/BEAM-12646 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187488499%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x3wE0UIwjU4ACQNPW0zb2%2F4oUi6nzRe9mhy5qSsQQA0%3D&reserved=0>
>  ?
> 
>  
> 
> This perf issue is currently a blocker to me..
> 
>  
> 
> Thanks so much!
> 
>  
> 
> From: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Friday, July 30, 2021 at 3:53 PM
> To: Andrew Pilloud <apill...@google.com <mailto:apill...@google.com>>, 
> "user@beam.apache.org <mailto:user@beam.apache.org>" <user@beam.apache.org 
> <mailto:user@beam.apache.org>>
> Cc: Kyle Weaver <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> Thanks everyone for your help.
> 
>  
> 
> We actually did another round of perf comparison between Beam (on spark) and 
> native spark, without any projection/filtering in the query (to rule out the 
> “predicate pushdown” factor).
> 
>  
> 
> The time spent on Beam with spark runner is still taking 3-5x period of time 
> compared with native spark, and the cause 
> ishttps://issues.apache.org/jira/browse/BEAM-12646 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187488499%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x3wE0UIwjU4ACQNPW0zb2%2F4oUi6nzRe9mhy5qSsQQA0%3D&reserved=0>
>  according to the spark metrics. Spark runner is pretty much the bottleneck.
> 
>  
> 
> <image001.png>
> 
>  
> 
> From: Andrew Pilloud <apill...@google.com <mailto:apill...@google.com>>
> Date: Thursday, July 29, 2021 at 2:11 PM
> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Cc: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>, Kyle Weaver 
> <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> Actually, ParquetIO got pushdown in Beam SQL starting at v2.29.0.
> 
>  
> 
> Andrew
> 
>  
> 
> On Mon, Jul 26, 2021 at 10:05 AM Andrew Pilloud <apill...@google.com 
> <mailto:apill...@google.com>> wrote:
> 
> Beam SQL doesn't currently have project pushdown for ParquetIO (we are 
> working to expand this to more IOs). Using ParquetIO withProjection directly 
> will produce better results.
> 
>  
> 
> On Mon, Jul 26, 2021 at 9:46 AM Robert Bradshaw <rober...@google.com 
> <mailto:rober...@google.com>> wrote:
> 
> Could you try using Beam SQL [1] and see if that gives more similar result to 
> your Spark SQL query? I would also be curious if the performance is 
> sufficient using withProjection to only read the auction, price, and bidder 
> columns. 
> 
>  
> 
> [1] https://beam.apache.org/documentation/dsls/sql/overview/ 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fdocumentation%2Fdsls%2Fsql%2Foverview%2F&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187498450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Wth3BzRulruxGaJSbmiTCQZj3S76E%2F6VdXTYsHTZExg%3D&reserved=0>
> [2] 
> https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.Read.html#withProjection-org.apache.avro.Schema-org.apache.avro.Schema-
>  
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.25.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fparquet%2FParquetIO.Read.html%23withProjection-org.apache.avro.Schema-org.apache.avro.Schema-&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187498450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=RrEBGsGMZLzLXaxoC2rSWtGKZ%2BwsjYAHFGeJxyXVrUQ%3D&reserved=0>
>  
> 
> On Sat, Jul 24, 2021 at 10:23 AM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> Thanks Robert for filing BEAM-12646 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187508407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KFZLNEj382MLIx2dEQEcVEKObN5IU1ZgXCuZiFpNn1Q%3D&reserved=0>.
>  This perf issue is a blocker for us to adopt Beam. It would be great if the 
> community could conclude the root cause and share an ETA for the fix. Thanks 
> so much!  
> 
>  
> 
>  
> 
> From: Robert Bradshaw <rober...@google.com <mailto:rober...@google.com>>
> Date: Wednesday, July 21, 2021 at 3:51 PM
> To: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Cc: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>, Kyle Weaver 
> <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> On Wed, Jul 21, 2021 at 3:00 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> @Robert Bradshaw <mailto:rober...@google.com> with Spark API, the code is 
> actually much simple. We are just calling spark SQL API against a hive table: 
> spark.sql(“SELECT auction, 0.82*(price) as euro, bidder  FROM bid”)
> 
>  
> 
> Good chance that this is pushing projection of those few fields up into the 
> read operator, which could be a dramatic savings. You could try doing it 
> manually in Beam, or use Beam's SQL that should do the same. 
> 
>  
> 
>  
> 
> I think the “globally windowed GBK” optimization you are proposing is a good 
> callout.
> 
>  
> 
> Filed https://issues.apache.org/jira/browse/BEAM-12646 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FBEAM-12646&data=04%7C01%7Ctaol%40zillow.com%7Cfc203509e0994ed0fc8b08d9724473bc%7C033464830d1840e7a5883784ac50e16f%7C0%7C0%7C637666461187508407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KFZLNEj382MLIx2dEQEcVEKObN5IU1ZgXCuZiFpNn1Q%3D&reserved=0>
>  to track. 
> 
>  
> 
>  
> 
> From: Robert Bradshaw <rober...@google.com <mailto:rober...@google.com>>
> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Wednesday, July 21, 2021 at 1:09 PM
> To: user <user@beam.apache.org <mailto:user@beam.apache.org>>
> Cc: Kyle Weaver <kcwea...@google.com <mailto:kcwea...@google.com>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> On Wed, Jul 21, 2021 at 12:51 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> Kyle, I don’t expect such a huge perf diff as well. To your question, no I am 
> not specifying withProjection or withSplit for parquet reader.
> 
>  
> 
> Are you doing so in your Spark code? 
> 
>  
> 
> Below is my parquet read code:
> 
>  
> 
> PCollection<FileIO.ReadableFile> files = pipeline
> 
>                 .apply(FileIO.match().filepattern(beamRequiredPath))
> 
>                 .apply(FileIO.readMatches());
> 
>  
> 
> PCollection<Row> table = files
> 
>                 .apply(ParquetIO
> 
>                         .readFiles(avroSchema)
> 
>                         
> .withConfiguration(ImmutableMap.of(AvroSchemaConverter.ADD_LIST_ELEMENT_RECORDS,
>  "false")))
> 
>                 .apply(MapElements
> 
>                         .into(TypeDescriptors.rows())
> 
>                         
> .via(AvroUtils.getGenericRecordToRowFunction(AvroUtils.toBeamSchema(avroSchema))))
> 
>                 .setCoder(RowCoder.of(AvroUtils.toBeamSchema(avroSchema)));
> 
>  
> 
>  
> 
> According to my investigation, looks like below call stack is very 
> computation intensive and causing a lot of GC time. And looks like the stack 
> comes from spark runner code.
> 
>  
> 
> This does look inordinately expensive. I wonder if it would make sense to 
> optimize the globally windowed GBK as some other runners do. 
> 
>  
> 
>  
> 
> <image001.png>
> 
>  
> 
> From: Kyle Weaver <kcwea...@google.com <mailto:kcwea...@google.com>>
> Date: Tuesday, July 20, 2021 at 3:57 PM
> To: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Cc: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>, Yuchu Cao 
> <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> Beam has its own implementation of Parquet IO, and doesn't use Spark's. It's 
> possible Spark's implementation does more optimizations, though perhaps not 
> enough to result in such a dramatic difference.
> 
>  
> 
> I'm curious how your Parquet read is configured. In particular, if 
> withProjection or withSplit are set.
> 
>  
> 
> On Tue, Jul 20, 2021 at 3:21 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> Hi Kyle,
> 
>  
> 
> The ParDo (which references the code I shared) is the only transformation in 
> my pipeline. The input and output are parquet files in S3 (we are using beam 
> ParquetIO).
> 
>  
> 
> From: Kyle Weaver <kcwea...@google.com <mailto:kcwea...@google.com>>
> Reply-To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Date: Tuesday, July 20, 2021 at 2:13 PM
> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Cc: Yuchu Cao <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
> 
>  
> 
> The DoFn you shared is simple enough that it seems unlikely to be the 
> performance bottleneck here.
> 
>  
> 
> Can you share more information about your complete pipeline? What other 
> transforms are there? What sources/sinks are you using?
> 
>  
> 
> On Tue, Jul 20, 2021 at 2:02 PM Tao Li <t...@zillow.com 
> <mailto:t...@zillow.com>> wrote:
> 
> Hi Beam community,
> 
>  
> 
> We are seeing a serious perf issue with beam using spark runner, compared 
> with writing a native spark app. Can you please provide some help?
> 
>  
> 
> The beam on spark app is taking 8-10 min, whereas a native spark is only 
> taking 2 min. Below is Spark UI, from which you can see the flatMapToPair 
> method is very time consuming. Is this method call coming from spark runner?
> 
>  
> 
> <image001.png>
> 
>  
> 
> I suspect this is caused by high GC time. See “GC Time” column below:
> 
>  
> 
> <image002.png>
> 
>  
> 
>  
> 
> The beam code is really simple, just a per row processing.
> 
>  
> 
> public class CalcFn extends DoFn<Row, Row> {
> 
>     protected Logger log = LoggerFactory.getLogger(this.getClass());
> 
>     private Schema schema;
> 
>  
> 
>     public CalcFn(Schema schema) {
> 
>         this.schema = schema;
> 
>  
> 
>  
> 
>  
> 
>     }
> 
>  
> 
>     @ProcessElement
> 
>     public void processElement(@Element Row row,OutputReceiver<Row> receiver) 
> {
> 
>        // Row row = ctx.element();
> 
>         Long auction_value = (Long) row.getBaseValue("auction");
> 
>         Long bid_value = (Long) row.getBaseValue("bidder");
> 
>         Long price = (Long) row.getBaseValue("price");
> 
>         Double euro = price * 0.82;
> 
>  
> 
>         receiver.output( Row.withSchema(schema)
> 
>                 .addValues(auction_value, euro, bid_value).build());
> 
>     }
> 
> }
> 
>  
> 

Reply via email to