Re: Issue with GroupByKey in BeamSql using SparkRunner

Vishwas Bm Sun, 21 Oct 2018 09:35:47 -0700

Hi,

I tried with 2.3.2 version of spark in local mode and I see the same issue.


Regards,
Vishwas


On Wed, Oct 10, 2018, 2:47 PM Ismaël Mejía <[email protected]> wrote:

> Are you trying this in a particular spark distribution or just locally ?
> I ask this because there was a data corruption issue with Spark 2.3.1
> (previous version used by Beam)
> https://issues.apache.org/jira/browse/SPARK-23243
>
> Current Beam master (and next release) moves Spark to version 2.3.2
> and that should fix some of the data correctness issues (maybe yours
> too).
> Can you give it a try and report back if this fixes your issue.
>
>
> On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm <[email protected]> wrote:
> >
> > Hi Kenn,
> >
> > We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2
> cluster on Kubernetes.
> >
> >
> > On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles <[email protected]> wrote:
> >>
> >> Thanks for the report! I filed
> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue.
> >>
> >> Can you share what version of Beam you are using?
> >>
> >> Kenn
> >>
> >> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm <[email protected]> wrote:
> >>>
> >>> We are trying to setup a pipeline with using BeamSql and the trigger
> used is default (AfterWatermark crosses the window).
> >>> Below is the pipeline:
> >>>
> >>>    KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) --->
> BeamSql ---> KafkaSink (KafkaIO)
> >>>
> >>> We are using Spark Runner for this.
> >>> The BeamSql query is:
> >>>              select Col3, count(*) as count_col1 from PCOLLECTION
> GROUP BY Col3
> >>>
> >>> We are grouping by Col3 which is a string. It can hold values
> string[0-9].
> >>>
> >>> The records are getting emitted out at 1 min to kafka sink, but the
> output record in kafka is not as expected.
> >>> Below is the output observed: (WST and WET are indicators for window
> start time and window end time)
> >>>
> >>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
> >>>
> >>> We ran the same pipeline using direct and flink runner and we dont see
> 0 entries for count_col1.
> >>>
> >>> As per beam matrix page (
> https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
> GroupBy is not fully supported,is this one of those cases ?
> >>> Thanks & Regards,
> >>> Vishwas
> >>>
>

Re: Issue with GroupByKey in BeamSql using SparkRunner

Reply via email to