Yes, you can use Windowing in batch mode by applying your own timestamps. For example,
from apache_beam.transforms.window import TimestampedValue grouped_by_window = (p | Read(...) | beam.Map(lambda x: TimestampedValue(x, timestamp=extract_timestamp(x)) | beam.WindowInto(FixedWindows(5)) | beam.Map(lambda x: (key(x), value(x))) # If the data wasn't already keyed... | beam.GroupByKey()) Now grouped_by_windows will contain a (k, vs) tuple for each key and five-second window. When running in batch mode, everything up to the group-by-key will be executed in its entirety before anything after, but the data will still be appropriately partitioned. - Robert On Sat, Jun 24, 2017 at 5:08 AM, Morand, Sebastien <[email protected]> wrote: > Hi Ahmet, > > Yes I know, it's exactly what I said in message: streaming is NOT supported > in python yet. > > My question is: is there any way to timestamp the bounded sources and use > the Windowing in BATCH mode? > > Regards, > > > Sébastien MORAND > Team Lead Solution Architect > Technology & Operations / Digital Factory > Veolia - Group Information Systems & Technology (IS&T) > Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 > Bureau 0144C (Ouest) > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France > www.veolia.com > > > > On 23 June 2017 at 21:32, Ahmet Altay <[email protected]> wrote: >> >> Hi Sebastien, >> >> Streaming is not supported in the Python SDK yet. There is a steady >> progress on that and soon we will be able to offer it. It is already >> supported in Java. >> >> Thank you, >> Ahmet >> >> On Fri, Jun 23, 2017 at 12:26 PM, Morand, Sebastien >> <[email protected]> wrote: >>> >>> Ok I think I understand these concepts, so as streamline is not supported >>> in python yet, there is no way to timestamp the bounded sources, isn't >>> there? >>> >>> Is it possible in java? >>> >>> Sébastien MORAND >>> Team Lead Solution Architect >>> Technology & Operations / Digital Factory >>> Veolia - Group Information Systems & Technology (IS&T) >>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>> Bureau 0144C (Ouest) >>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>> www.veolia.com >>> >>> >>> >>> On 23 June 2017 at 17:04, Eugene Kirpichov <[email protected]> wrote: >>>> >>>> To elaborate on what Kenn said: the main difference between batch and >>>> streaming execution modes of a typical Beam runner is how they handle >>>> GroupByKey. >>>> >>>> The next stage can start processing the KV<K, Iterable<V>> in a >>>> particular window (assuming Java) when the GBK is sure it's seen all >>>> elements with this key in this window. >>>> >>>> A typical runner in streaming mode assumes that data within a key is >>>> more or less ordered by time, with the watermark giving a bound on how out >>>> of order it's likely to be. Because of that, runners can eg declare a >>>> window >>>> of a key "complete" when the watermark reaches the end of window. More >>>> precisely, this is controlled by triggers. >>>> >>>> However, typical data sources used in batch mode (eg files, key value >>>> tables etc) are not ordered by time at all and hence data is assumed to be >>>> completely out of order and no useful watermark can be provided. So runners >>>> have only one option: process all data, and then declare all keys and all >>>> windows complete, and then start the next stage. >>>> >>>> Note that this is not a fundamental limitation of the Beam model: it's >>>> conceivable to have a runner that reads a bounded dataset (eg a set of log >>>> files marked with date in the filename) with some knowledge of their time >>>> ordering and could pipeline the stages. Just current runners treat all >>>> bounded datasets as completely out of order, and the current API for custom >>>> bounded sources doesn't even have a function for reporting a watermark. >>>> >>>> Hope this helps. >>>> >>>> >>>> On Fri, Jun 23, 2017, 7:02 AM Kenneth Knowles <[email protected]> wrote: >>>>> >>>>> Hello! >>>>> >>>>> The behavior you are seeing is what makes something batch mode >>>>> processing. The essential definition of streaming mode processing is that >>>>> you get output before you have processed all the data. >>>>> >>>>> Event time windowing does not control when computations occur - when >>>>> you window into FixedWindows of five seconds, this means your data will be >>>>> grouped according to the window that contains the timestamp on the event. >>>>> In >>>>> streaming mode, this grouping will generally be output soon after the >>>>> watermark exceeds the end of the window, but this can be customized using >>>>> triggers. >>>>> >>>>> Kenn >>>>> >>>>> On Thu, Jun 22, 2017 at 10:13 AM, Morand, Sebastien >>>>> <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to window a batch, but whatever I try, the timestamp is not >>>>>> working, it's blocked by the group by : >>>>>> >>>>>> >>>>>> I want the transform-comine (the last on my screenshot) starts before >>>>>> the group_source has read all the files in input, and it's never >>>>>> happening. >>>>>> First it's reading all my files in the group by, and when it's over, it >>>>>> starts the transform-combine ... >>>>>> >>>>>> I put a FixedWindow with 5 seconds, doesn't change anything, any way >>>>>> to do so? >>>>>> >>>>>> NB, JOB ID : 2017-06-22_10_03_47-4479358064592021427 >>>>>> >>>>>> thanks by advance >>>>>> >>>>>> Sébastien MORAND >>>>>> Team Lead Solution Architect >>>>>> Technology & Operations / Digital Factory >>>>>> Veolia - Group Information Systems & Technology (IS&T) >>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>>>>> Bureau 0144C (Ouest) >>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>>>>> www.veolia.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------------------------- >>>>>> This e-mail transmission (message and any attached files) may contain >>>>>> information that is proprietary, privileged and/or confidential to Veolia >>>>>> Environnement and/or its affiliates and is intended exclusively for the >>>>>> person(s) to whom it is addressed. If you are not the intended recipient, >>>>>> please notify the sender by return e-mail and delete all copies of this >>>>>> e-mail, including all attachments. Unless expressly authorized, any use, >>>>>> disclosure, publication, retransmission or dissemination of this e-mail >>>>>> and/or of its attachments is strictly prohibited. >>>>>> >>>>>> Ce message electronique et ses fichiers attaches sont strictement >>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement >>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >>>>>> message par erreur, merci de le retourner a son emetteur et de le >>>>>> detruire >>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >>>>>> publication, la distribution, ou la reproduction non expressement >>>>>> autorisees >>>>>> de ce message et de ses pieces attachees sont interdites. >>>>>> >>>>>> -------------------------------------------------------------------------------------------- >>>>> >>>>> >>> >>> >>> >>> >>> -------------------------------------------------------------------------------------------- >>> This e-mail transmission (message and any attached files) may contain >>> information that is proprietary, privileged and/or confidential to Veolia >>> Environnement and/or its affiliates and is intended exclusively for the >>> person(s) to whom it is addressed. If you are not the intended recipient, >>> please notify the sender by return e-mail and delete all copies of this >>> e-mail, including all attachments. Unless expressly authorized, any use, >>> disclosure, publication, retransmission or dissemination of this e-mail >>> and/or of its attachments is strictly prohibited. >>> >>> Ce message electronique et ses fichiers attaches sont strictement >>> confidentiels et peuvent contenir des elements dont Veolia Environnement >>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >>> message par erreur, merci de le retourner a son emetteur et de le detruire >>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >>> publication, la distribution, ou la reproduction non expressement autorisees >>> de ce message et de ses pieces attachees sont interdites. >>> >>> -------------------------------------------------------------------------------------------- >> >> > > > > -------------------------------------------------------------------------------------------- > This e-mail transmission (message and any attached files) may contain > information that is proprietary, privileged and/or confidential to Veolia > Environnement and/or its affiliates and is intended exclusively for the > person(s) to whom it is addressed. If you are not the intended recipient, > please notify the sender by return e-mail and delete all copies of this > e-mail, including all attachments. Unless expressly authorized, any use, > disclosure, publication, retransmission or dissemination of this e-mail > and/or of its attachments is strictly prohibited. > > Ce message electronique et ses fichiers attaches sont strictement > confidentiels et peuvent contenir des elements dont Veolia Environnement > et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc > destines a l'usage de leurs seuls destinataires. Si vous avez recu ce > message par erreur, merci de le retourner a son emetteur et de le detruire > ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la > publication, la distribution, ou la reproduction non expressement autorisees > de ce message et de ses pieces attachees sont interdites. > --------------------------------------------------------------------------------------------
