Re: Issue with GroupByKey in BeamSql using SparkRunner

Vishwas Bm Tue, 09 Oct 2018 09:46:25 -0700

Hi Kenn,

We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2
cluster on Kubernetes.



On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles <[email protected]> wrote:

> Thanks for the report! I filed
> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue.
>
> Can you share what version of Beam you are using?
>
> Kenn
>
> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm <[email protected]> wrote:
>
>> We are trying to setup a pipeline with using BeamSql and the trigger used
>> is default (AfterWatermark crosses the window).
>> Below is the pipeline:
>>
>>    KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql
>> ---> KafkaSink (KafkaIO)
>>
>> We are using Spark Runner for this.
>> The BeamSql query is:
>>              select Col3, count(*) as count_col1 from PCOLLECTION GROUP
>> BY Col3
>>
>> We are grouping by Col3 which is a string. It can hold values
>> string[0-9].
>>
>> The records are getting emitted out at 1 min to kafka sink, but the
>> output record in kafka is not as expected.
>> Below is the output observed: (WST and WET are indicators for window
>> start time and window end time)
>>
>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}
>>
>>
>>
>>
>>
>>
>>
>>
>> *{"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000
>> +0000"}{"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000
>> +0000","WET":"2018-10-09  09-56-00 0000  +0000"}*
>>
>> We ran the same pipeline using direct and flink runner and we dont see 0
>> entries for count_col1.
>>
>> As per beam matrix page (
>> https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
>> GroupBy is not fully supported,is this one of those cases ?
>>
>> *Thanks & Regards,*
>>
>> *Vishwas *
>>
>>

Re: Issue with GroupByKey in BeamSql using SparkRunner

Reply via email to